I recently decided to pursue the Imperial Business Analytics Programme by the Imperial College Business School. I decided to make my final project on a relevant topic now that everyone is trying to move on with their lives without leaving home: Internet user experience and data networks. COVID-19 has created unprecedent challenges for data networks: Just as in many other countries, the Colombian government imposed a strict lockdown as a response to COVID-19 emergency. In a very short time, and without preparation, Colombian society had to adapt so that a significant part of the economy and the society can still function without leaving home. This adaptation was required an accelerated adoption of digital tools:
Many companies have been forced to adapt so that their employees can work from home. Schools have been forced to close and offer remote learning There was an unprecedent surge in streaming and internet-based entertainment services Adoption of digital and remote tools have resulted in an unprecedent increase in data traffic that was not foreseen. Networks became more critical than ever so that society could function at least to some degree.
Network operators have been forced to cope with such traffic spike without proper preparation. Capacity expansion processes normally had yearly cycles using previous 12-month window traffic as an input to predict near-term needs. Many people believe that operators enjoyed a boost in revenue and profits because of the increase in traffic, as experienced by Internet based companies such as Amazon or Zoom. However, operators’ public results have shown the opposite: Their revenue and cash-in has been reduced because of an increase in lack of payment, overdues, and cancellations. Financial deterioration further constrained operators’ capacities to expand their networks to match traffic increase.
My project performed an analysis in Colombia (my home country) to understand if such traffic increase has resulted in deterioration of user experience and identify municipalities that have suffered the most so that public policy can be adapted to support investments in network infrastructure for such municipalities.
My project performed an assessment on how unexpected traffic surge due to COVID-19 has impacted internet experience KPIs (average downlink speed, average uplink speed, latency) in all municipalities in Colombia compared to pre-COVID-19 scenario. The assessment analyzed the time series of aggregated traffic in Colombia to confirm the historical raise of traffic, assessed the impact of such surge possibly because of network saturation, classified municipalities based on how severe the impact on user experience was, and forecasted possible traffic trends after the economy re-opened. Results could be used by public policy makers to better define where incentives for network infrastructure investments should be prioritized, and which technologies should be incentivized.
The Imperial Analytics Programme provides an overview of some of the statistical and machine learning tools that can be used to solve problems and take decisions. Most of the programming of the course is done in Phyton using Anaconda. I tried to run the necessary transformations to the dataset using Anaconda and my local machine. After several memory crashes I realized that I would require a cloud platform. I had previously experienced with IBM Cloud when I pursued my IBM Data Science Certification: IBM Cloud is a practical, intuitive platform that is easy to use even for beginners. The free account version allows to setup a free account with an environment that allows up to 50 monthly capacity units per hour, a 4vCPU + 16GB RAM environment utilizes 2 capacity units per hour. However, I estimated that I would require something higher than 16GB of RAM to run my code, therefore I decided to experience with a different platform that I have been using in my current job: I decided to try Google Cloud Platform. GCP allows any user to setup and account and start a 90-day trial and a 300 USD trial. It is very simple to setup a standard instance with 4vCPU and 32GB of RAM, which was sufficient computing power for my project.
Additionally, I had the opportunity to experience with various Phyton libraries including: Geopandas library, which is a powerful tool to perform joins and transformations on geo-referenced data, Mapclassify library to plot georeferenced data in chronopleth maps, Seaborn and Matplotlib data visualization tools and several of the classifier and clustering tools from Sklearn library.
I was amazed by the vast amount of public data that was available for my project. I was surprised to see that Ookla speedtest has made available to the public the dataset containing aggregated and anonymized geo-referenced data on downlink speed, uplink speed and latency. I was also happy to see that Colombian's government initiative to consolidate a repository of public data "datos.gov.co" has advanced and now includes many datasets from the Ministry of Telecommunications. I identified that these sources of data, which are completely public and available to anyone, would be sufficient to perform my analysis as originally envisioned. I specifically included the following sources of data:
As supported by the analysis below, I indeed found reached some interesting conclusions:
import matplotlib.pyplot as plt
import geopandas as gpd
import geoplot as gpl
import mapclassify as mpclas
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
from sklearn import svm
from sklearn import preprocessing
from sklearn.cluster import KMeans
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
# Speed test fixed Q1 dataset:
st_fixed_q1 = gpd.read_file('gs://imperial_analytics_project/sources/1_speed_test_titles/fixed_q1_20/gps_fixed_tiles.shp')
print("Speed test fixed Q1 data loaded Shape:", st_fixed_q1.shape)
display(st_fixed_q1.head())
# Speed test fixed Q2 dataset:
st_fixed_q2 = gpd.read_file('gs://imperial_analytics_project/sources/1_speed_test_titles/fixed_q2_20/gps_fixed_tiles.shp')
print("Speed test fixed Q2 data loaded Shape:", st_fixed_q2.shape)
display(st_fixed_q2.head())
# Speed test fixed Q3 dataset:
st_fixed_q3 = gpd.read_file('gs://imperial_analytics_project/sources/1_speed_test_titles/fixed_q3_20/gps_fixed_tiles.shp')
print("Speed test fixed Q3 data loaded Shape:", st_fixed_q3.shape)
display(st_fixed_q3.head())
# Speed test mobile Q1 dataset:
st_mob_q1 = gpd.read_file('gs://imperial_analytics_project/sources/1_speed_test_titles/mobile_q1_20/gps_mobile_tiles.shp')
print("Speed test mobile Q1 data loaded Shape:", st_mob_q1.shape)
display(st_mob_q1.head())
# Speed test mobile Q2 dataset:
st_mob_q2 = gpd.read_file('gs://imperial_analytics_project/sources/1_speed_test_titles/mobile_q2_20/gps_mobile_tiles.shp')
print("Speed test mobile Q2 data loaded Shape:", st_mob_q2.shape)
display(st_mob_q2.head())
# Speed test mobile Q3 dataset:
st_mob_q3 = gpd.read_file('gs://imperial_analytics_project/sources/1_speed_test_titles/mobile_q3_20/gps_mobile_tiles.shp')
print("Speed test mobile Q3 data loaded Shape:", st_mob_q3.shape)
display(st_mob_q3.head())
#Show all columns
pd.set_option('display.max_columns', None)
#Municipality SHP files
mun_shp = gpd.read_file('gs://imperial_analytics_project/sources/2_dane_shp_mun/MGN_MPIO_POLITICO.shp')
print("Municipalities geo-shape data loaded Shape:", mun_shp.shape)
display(mun_shp.head(2))
#Aggregated traffic per day per operator
agg_traf_op = pd.read_csv('gs://imperial_analytics_project/sources/3_mintic_agg_traff_day/agg_traf_apr_dec.csv')
agg_traf_base = pd.read_csv('gs://imperial_analytics_project/sources/3_mintic_agg_traff_day/agg_traf_base.csv')
print("Aggregated traffice per day per operator data loaded Shape:", agg_traf_op.shape)
display(agg_traf_op.head(2))
print("Aggregated traffice per day per operator base loaded Shape:", agg_traf_base.shape)
display(agg_traf_base.head(2))
#Fixed Access Penetration by municipality
fixed_acc_pen = pd.read_csv('gs://imperial_analytics_project/sources/4_mintic_fixed_mobile_by_mun/fixed_acc_by_mun_pen_v4.csv')
print("Fixed access penetration by municipality data loaded Shape:", fixed_acc_pen.shape)
display(fixed_acc_pen.head(2))
#Fixed Access Numer of Access Type by municipality
fixed_acc_type = pd.read_csv('gs://imperial_analytics_project/sources/4_mintic_fixed_mobile_by_mun/fixed_acc_by_mun_type_v4.csv')
print("Fixed Access Numer of Access Type by municipality data loaded Shape:", fixed_acc_type.shape)
display(fixed_acc_type.head(2))
#Mobile Access Coverage by Technology by municipality
mobile_cov_tech = pd.read_csv('gs://imperial_analytics_project/sources/4_mintic_fixed_mobile_by_mun/mobile_cov_per_tech_by_mun_v4.csv')
print("Mobile Access Coverage by Technology by municipality data loaded Shape:", mobile_cov_tech.shape)
display(mobile_cov_tech.head(2))
#DANE GDP by municipality
gdp_mun = pd.read_csv('gs://imperial_analytics_project/sources/5_dane_gdp/dane_gdp_per_mun.csv')
print("GDP by municipality data loaded Shape:", gdp_mun.shape)
display(gdp_mun.head(2))
The analysis requires all datasets to be joined to have a common, geo-referenced dataset that can be used to compare different municipalities and quarters. To do so I followed these steps to prepare the data:
As a result of the dataset transformation, two geo-located dataframes (SHP files) are generated:
I used geopandas sjoin function to assign speed test measurements to Colombian municipalities. Geopanda's sjoin is a powerful function, it could effectively join the tables and assign around 30k tiles out more than 6 million to the corresponding municipality in Colombia. The drawback from this function is that it is rather slow, as it is not the most efficient way to compute large geo-referenced datasets. If you are interested in how to optimize geo-referenced processing you can check this blog by dask.org (https://blog.dask.org/2017/09/21/accelerating-geopandas-1). It took a while, but geopandas managed to do the work and generate 6 dataframes with municipality SHP geometry (one dataframe per quarter for mobile and fixed data).
# Fixed Q1 sjoin
sjoin_fix_q1 = gpd.sjoin(mun_shp, st_fixed_q1, how='inner', op='intersects', lsuffix='left', rsuffix='right')
print("Speed test fixed Q1 sjoin with municipality SHP files data loaded Shape:", sjoin_fix_q1.shape)
display(sjoin_fix_q1.head(2))
# Fixed Q2 sjoin
sjoin_fix_q2 = gpd.sjoin(mun_shp, st_fixed_q2, how='inner', op='intersects', lsuffix='left', rsuffix='right')
print("Speed test fixed Q2 sjoin with municipality SHP files data loaded Shape:", sjoin_fix_q2.shape)
display(sjoin_fix_q2.head(2))
# Fixed Q3 sjoin
sjoin_fix_q3 = gpd.sjoin(mun_shp, st_fixed_q3, how='inner', op='intersects', lsuffix='left', rsuffix='right')
print("Speed test fixed Q3 sjoin with municipality SHP files data loaded Shape:", sjoin_fix_q3.shape)
display(sjoin_fix_q3.head(2))
# Mobile Q1 sjoin
sjoin_mob_q1 = gpd.sjoin(mun_shp, st_mob_q1, how='inner', op='intersects', lsuffix='left', rsuffix='right')
print("Speed test mobile Q1 sjoin with municipality SHP files data loaded Shape:", sjoin_mob_q1.shape)
display(sjoin_mob_q1.head(2))
# Mobile Q2 sjoin
sjoin_mob_q2 = gpd.sjoin(mun_shp, st_mob_q2, how='inner', op='intersects', lsuffix='left', rsuffix='right')
print("Speed test mobile Q2 sjoin with municipality SHP files data loaded Shape:", sjoin_mob_q2.shape)
display(sjoin_mob_q2.head(2))
# Mobile Q3 sjoin
sjoin_mob_q3 = gpd.sjoin(mun_shp, st_mob_q3, how='inner', op='intersects', lsuffix='left', rsuffix='right')
print("Speed test mobile Q3 sjoin with municipality SHP files data loaded Shape:", sjoin_mob_q3.shape)
display(sjoin_mob_q3.head(2))
I first replicate the original dataframe distribution by repeating reach test measurement for each quarter dataset. I then perform a groupby using the concatenation of the department code and the municipality code as groupby key.
sjoin_fix_q1['DPTO_MPIO'] = sjoin_fix_q1['DPTO_CCDGO'] + "_" + sjoin_fix_q1['MPIO_CCDGO']
sjoin_fix_q1r = sjoin_fix_q1.sort_values(by = 'quadkey').reset_index(drop=True)
sjoin_fix_q1r = sjoin_fix_q1r.loc[sjoin_fix_q1r.index.repeat(sjoin_fix_q1r.tests)]
# Create unique key for groupby
print("Fixed Q1 sjoin repeat by municipality generated Shape:", sjoin_fix_q1r.shape)
display(sjoin_fix_q1r.head(2))
sjoin_fix_q1r_avg = sjoin_fix_q1r[['DPTO_MPIO', 'avg_d_kbps', 'avg_u_kbps', 'avg_lat_ms']].groupby(['DPTO_MPIO']).mean().rename(columns={'avg_d_kbps': 'avg_d_kbps_q1', 'avg_u_kbps': 'avg_u_kbps_q1', 'avg_lat_ms': 'avg_lat_ms_q1'})
sjoin_fix_q1r_std = sjoin_fix_q1r[['DPTO_MPIO', 'avg_d_kbps', 'avg_u_kbps', 'avg_lat_ms']].groupby(by = ['DPTO_MPIO']).std().rename(columns={'avg_d_kbps': 'std_d_kbps_q1', 'avg_u_kbps': 'std_u_kbps_q1', 'avg_lat_ms': 'std_lat_ms_q1'})
sjoin_fix_q1_sum = sjoin_fix_q1[['DPTO_MPIO','tests', 'devices']].groupby(['DPTO_MPIO']).sum().rename(columns={'tests': 'tests_q1', 'devices': 'devices_q1'})
sjoin_fix_q1r_group = sjoin_fix_q1r_avg.join(sjoin_fix_q1r_std)
sjoin_fix_q1r_group = sjoin_fix_q1r_group.join(sjoin_fix_q1_sum)
print("Fixed Q1 groupby by municipality generated Shape:", sjoin_fix_q1r_group.shape)
display(sjoin_fix_q1r_group.head(2))
sjoin_fix_q2['DPTO_MPIO'] = sjoin_fix_q2['DPTO_CCDGO'] + "_" + sjoin_fix_q2['MPIO_CCDGO']
sjoin_fix_q2r = sjoin_fix_q2.sort_values(by = 'quadkey').reset_index(drop=True)
sjoin_fix_q2r = sjoin_fix_q2r.loc[sjoin_fix_q2r.index.repeat(sjoin_fix_q2r.tests)]
# Create unique key for groupby
print("Fixed Q2 sjoin repeat by municipality generated Shape:", sjoin_fix_q2r.shape)
display(sjoin_fix_q2r.head(2))
sjoin_fix_q2r_avg = sjoin_fix_q2r[['DPTO_MPIO', 'avg_d_kbps', 'avg_u_kbps', 'avg_lat_ms']].groupby(['DPTO_MPIO']).mean().rename(columns={'avg_d_kbps': 'avg_d_kbps_q2', 'avg_u_kbps': 'avg_u_kbps_q2', 'avg_lat_ms': 'avg_lat_ms_q2'})
sjoin_fix_q2r_std = sjoin_fix_q2r[['DPTO_MPIO', 'avg_d_kbps', 'avg_u_kbps', 'avg_lat_ms']].groupby(by = ['DPTO_MPIO']).std().rename(columns={'avg_d_kbps': 'std_d_kbps_q2', 'avg_u_kbps': 'std_u_kbps_q2', 'avg_lat_ms': 'std_lat_ms_q2'})
sjoin_fix_q2_sum = sjoin_fix_q2[['DPTO_MPIO','tests', 'devices']].groupby(['DPTO_MPIO']).sum().rename(columns={'tests': 'tests_q2', 'devices': 'devices_q2'})
sjoin_fix_q2r_group = sjoin_fix_q2r_avg.join(sjoin_fix_q2r_std)
sjoin_fix_q2r_group = sjoin_fix_q2r_group.join(sjoin_fix_q2_sum)
print("Fixed Q2 groupby by municipality generated Shape:", sjoin_fix_q2r_group.shape)
display(sjoin_fix_q2r_group.head(2))
sjoin_fix_q3['DPTO_MPIO'] = sjoin_fix_q3['DPTO_CCDGO'] + "_" + sjoin_fix_q3['MPIO_CCDGO']
sjoin_fix_q3r = sjoin_fix_q3.sort_values(by = 'quadkey').reset_index(drop=True)
sjoin_fix_q3r = sjoin_fix_q3r.loc[sjoin_fix_q3r.index.repeat(sjoin_fix_q3r.tests)]
# Create unique key for groupby
print("Fixed Q3 sjoin repeat by municipality generated Shape:", sjoin_fix_q3r.shape)
display(sjoin_fix_q3r.head(2))
sjoin_fix_q3r_avg = sjoin_fix_q3r[['DPTO_MPIO', 'avg_d_kbps', 'avg_u_kbps', 'avg_lat_ms']].groupby(['DPTO_MPIO']).mean().rename(columns={'avg_d_kbps': 'avg_d_kbps_q3', 'avg_u_kbps': 'avg_u_kbps_q3', 'avg_lat_ms': 'avg_lat_ms_q3'})
sjoin_fix_q3r_std = sjoin_fix_q3r[['DPTO_MPIO', 'avg_d_kbps', 'avg_u_kbps', 'avg_lat_ms']].groupby(by = ['DPTO_MPIO']).std().rename(columns={'avg_d_kbps': 'std_d_kbps_q3', 'avg_u_kbps': 'std_u_kbps_q3', 'avg_lat_ms': 'std_lat_ms_q3'})
sjoin_fix_q3_sum = sjoin_fix_q3[['DPTO_MPIO','tests', 'devices']].groupby(['DPTO_MPIO']).sum().rename(columns={'tests': 'tests_q3', 'devices': 'devices_q3'})
sjoin_fix_q3r_group = sjoin_fix_q3r_avg.join(sjoin_fix_q3r_std)
sjoin_fix_q3r_group = sjoin_fix_q3r_group.join(sjoin_fix_q3_sum)
print("Fixed Q3 groupby by municipality generated Shape:", sjoin_fix_q3r_group.shape)
display(sjoin_fix_q3r_group.head(2))
sjoin_mob_q1['DPTO_MPIO'] = sjoin_mob_q1['DPTO_CCDGO'] + "_" + sjoin_mob_q1['MPIO_CCDGO']
sjoin_mob_q1r = sjoin_mob_q1.sort_values(by = 'quadkey').reset_index(drop=True)
sjoin_mob_q1r = sjoin_mob_q1r.loc[sjoin_mob_q1r.index.repeat(sjoin_mob_q1r.tests)]
# Create unique key for groupby
print("Mobile Q1 sjoin repeat by municipality generated Shape:", sjoin_mob_q1r.shape)
display(sjoin_mob_q1r.head(2))
sjoin_mob_q1r_avg = sjoin_mob_q1r[['DPTO_MPIO', 'avg_d_kbps', 'avg_u_kbps', 'avg_lat_ms']].groupby(['DPTO_MPIO']).mean().rename(columns={'avg_d_kbps': 'avg_d_kbps_q1', 'avg_u_kbps': 'avg_u_kbps_q1', 'avg_lat_ms': 'avg_lat_ms_q1'})
sjoin_mob_q1r_std = sjoin_mob_q1r[['DPTO_MPIO', 'avg_d_kbps', 'avg_u_kbps', 'avg_lat_ms']].groupby(by = ['DPTO_MPIO']).std().rename(columns={'avg_d_kbps': 'std_d_kbps_q1', 'avg_u_kbps': 'std_u_kbps_q1', 'avg_lat_ms': 'std_lat_ms_q1'})
sjoin_mob_q1_sum = sjoin_mob_q1[['DPTO_MPIO','tests', 'devices']].groupby(['DPTO_MPIO']).sum().rename(columns={'tests': 'tests_q1', 'devices': 'devices_q1'})
sjoin_mob_q1r_group = sjoin_mob_q1r_avg.join(sjoin_mob_q1r_std)
sjoin_mob_q1r_group = sjoin_mob_q1r_group.join(sjoin_mob_q1_sum)
print("Mobile Q1 groupby by municipality generated Shape:", sjoin_mob_q1r_group.shape)
display(sjoin_mob_q1r_group.head(2))
sjoin_mob_q2['DPTO_MPIO'] = sjoin_mob_q2['DPTO_CCDGO'] + "_" + sjoin_mob_q2['MPIO_CCDGO']
sjoin_mob_q2r = sjoin_mob_q2.sort_values(by = 'quadkey').reset_index(drop=True)
sjoin_mob_q2r = sjoin_mob_q2r.loc[sjoin_mob_q2r.index.repeat(sjoin_mob_q2r.tests)]
# Create unique key for groupby
print("Mobile Q2 sjoin repeat by municipality generated Shape:", sjoin_mob_q2r.shape)
display(sjoin_mob_q2r.head(2))
sjoin_mob_q2r_avg = sjoin_mob_q2r[['DPTO_MPIO', 'avg_d_kbps', 'avg_u_kbps', 'avg_lat_ms']].groupby(['DPTO_MPIO']).mean().rename(columns={'avg_d_kbps': 'avg_d_kbps_q2', 'avg_u_kbps': 'avg_u_kbps_q2', 'avg_lat_ms': 'avg_lat_ms_q2'})
sjoin_mob_q2r_std = sjoin_mob_q2r[['DPTO_MPIO', 'avg_d_kbps', 'avg_u_kbps', 'avg_lat_ms']].groupby(by = ['DPTO_MPIO']).std().rename(columns={'avg_d_kbps': 'std_d_kbps_q2', 'avg_u_kbps': 'std_u_kbps_q2', 'avg_lat_ms': 'std_lat_ms_q2'})
sjoin_mob_q2_sum = sjoin_mob_q2[['DPTO_MPIO','tests', 'devices']].groupby(['DPTO_MPIO']).sum().rename(columns={'tests': 'tests_q2', 'devices': 'devices_q2'})
sjoin_mob_q2r_group = sjoin_mob_q2r_avg.join(sjoin_mob_q2r_std)
sjoin_mob_q2r_group = sjoin_mob_q2r_group.join(sjoin_mob_q2_sum)
print("Mobile Q2 groupby by municipality generated Shape:", sjoin_mob_q2r_group.shape)
display(sjoin_mob_q2r_group.head(2))
sjoin_mob_q3['DPTO_MPIO'] = sjoin_mob_q3['DPTO_CCDGO'] + "_" + sjoin_mob_q3['MPIO_CCDGO']
sjoin_mob_q3r = sjoin_mob_q3.sort_values(by = 'quadkey').reset_index(drop=True)
sjoin_mob_q3r = sjoin_mob_q3r.loc[sjoin_mob_q3r.index.repeat(sjoin_mob_q3r.tests)]
# Create unique key for groupby
print("Mobile Q3 sjoin repeat by municipality generated Shape:", sjoin_mob_q3r.shape)
display(sjoin_mob_q3r.head(2))
sjoin_mob_q3r_avg = sjoin_mob_q3r[['DPTO_MPIO', 'avg_d_kbps', 'avg_u_kbps', 'avg_lat_ms']].groupby(['DPTO_MPIO']).mean().rename(columns={'avg_d_kbps': 'avg_d_kbps_q3', 'avg_u_kbps': 'avg_u_kbps_q3', 'avg_lat_ms': 'avg_lat_ms_q3'})
sjoin_mob_q3r_std = sjoin_mob_q3r[['DPTO_MPIO', 'avg_d_kbps', 'avg_u_kbps', 'avg_lat_ms']].groupby(by = ['DPTO_MPIO']).std().rename(columns={'avg_d_kbps': 'std_d_kbps_q3', 'avg_u_kbps': 'std_u_kbps_q3', 'avg_lat_ms': 'std_lat_ms_q3'})
sjoin_mob_q3_sum = sjoin_mob_q3[['DPTO_MPIO','tests', 'devices']].groupby(['DPTO_MPIO']).sum().rename(columns={'tests': 'tests_q3', 'devices': 'devices_q3'})
sjoin_mob_q3r_group = sjoin_mob_q3r_avg.join(sjoin_mob_q3r_std)
sjoin_mob_q3r_group = sjoin_mob_q3r_group.join(sjoin_mob_q3_sum)
print("Mobile Q3 groupby by municipality generated Shape:", sjoin_mob_q3r_group.shape)
display(sjoin_mob_q3r_group.head(2))
# Fixed
fixed_consolidated = sjoin_fix_q1r_group.join(sjoin_fix_q2r_group, how = "left")
fixed_consolidated = fixed_consolidated.join(sjoin_fix_q3r_group, how = "inner")
print("Fixed consolidated generated Shape:", fixed_consolidated.shape)
display(fixed_consolidated.head())
display(fixed_consolidated.describe())
#Mobile
mobile_consolidated = sjoin_mob_q1r_group.join(sjoin_mob_q2r_group, how = "left")
mobile_consolidated = mobile_consolidated.join(sjoin_mob_q3r_group, how = "inner")
print("Mobile consolidated generated Shape:", mobile_consolidated.shape)
display(mobile_consolidated.head())
display(mobile_consolidated.describe())
#fixed_consolidated.to_csv('gs://imperial_analytics_project/backup/fixed_consolidated/fixed_consolidated.csv')
#sjoin_fix_q1r_group.to_csv('gs://imperial_analytics_project/backup/fixed_consolidated/sjoin_fix_q1r_group.csv')
#sjoin_fix_q2r_group.to_csv('gs://imperial_analytics_project/backup/fixed_consolidated/sjoin_fix_q2r_group.csv')
#sjoin_fix_q3r_group.to_csv('gs://imperial_analytics_project/backup/fixed_consolidated/sjoin_fix_q3r_group.csv')
fixed_consolidated = pd.read_csv('gs://imperial_analytics_project/backup/fixed_consolidated/fixed_consolidated.csv', index_col='DPTO_MPIO')
#sjoin_fix_q1r_group = pd.read_csv('gs://imperial_analytics_project/backup/fixed_consolidated/sjoin_fix_q1r_group.csv', index_col='DPTO_MPIO')
#sjoin_fix_q2r_group = pd.read_csv('gs://imperial_analytics_project/backup/fixed_consolidated/sjoin_fix_q2r_group.csv', index_col='DPTO_MPIO')
#sjoin_fix_q3r_group = pd.read_csv('gs://imperial_analytics_project/backup/fixed_consolidated/sjoin_fix_q3r_group.csv', index_col='DPTO_MPIO')
#print("Fixed Q1 loaded Shape:", sjoin_fix_q1r_group.shape)
#print("Fixed Q2 loaded Shape:", sjoin_fix_q2r_group.shape)
#print("Fixed Q3 loaded Shape:", sjoin_fix_q3r_group.shape)
#display(sjoin_fix_q1r_group.head(2))
print("Fixed consolidated generated Shape:", fixed_consolidated.shape)
display(fixed_consolidated.head(2))
#Backup and load consolidated data
#mobile_consolidated.to_csv('gs://imperial_analytics_project/backup/mobile_consolidated/mobile_consolidated.csv')
#sjoin_mob_q1r_group.to_csv('gs://imperial_analytics_project/backup/mobile_consolidated/sjoin_mob_q1r_group.csv')
#sjoin_mob_q2r_group.to_csv('gs://imperial_analytics_project/backup/mobile_consolidated/sjoin_mob_q2r_group.csv')
#sjoin_mob_q3r_group.to_csv('gs://imperial_analytics_project/backup/mobile_consolidated/sjoin_mob_q3r_group.csv')
mobile_consolidated = pd.read_csv('gs://imperial_analytics_project/backup/mobile_consolidated/mobile_consolidated.csv', index_col='DPTO_MPIO')
#sjoin_mob_q1r_group = pd.read_csv('gs://imperial_analytics_project/backup/mobile_consolidated/sjoin_mob_q1r_group.csv', index_col='DPTO_MPIO')
#sjoin_mob_q2r_group = pd.read_csv('gs://imperial_analytics_project/backup/mobile_consolidated/sjoin_mob_q2r_group.csv', index_col='DPTO_MPIO')
#sjoin_mob_q3r_group = pd.read_csv('gs://imperial_analytics_project/backup/mobile_consolidated/sjoin_mob_q3r_group.csv', index_col='DPTO_MPIO')
#print("Mobile Q1 loaded Shape:", sjoin_mob_q1r_group.shape)
#print("Mobile Q2 loaded Shape:", sjoin_mob_q2r_group.shape)
#print("Mobile Q3 loaded Shape:", sjoin_mob_q3r_group.shape)
#display(sjoin_mob_q1r_group.head(2))
print("Mobile consolidated generated Shape:", mobile_consolidated.shape)
display(mobile_consolidated.head(2))
#Join key for municipalities dataset
mun_shp_join = mun_shp.drop_duplicates()
mun_shp_join['DPTO_MPIO'] = mun_shp_join['DPTO_CCDGO'] + "_" + mun_shp_join['MPIO_CCDGO']
mun_shp_join = mun_shp_join.set_index('DPTO_MPIO')
mun_shp_join['MPIO_CCDGO'] = mun_shp_join['MPIO_CCDGO'].astype('int64')
mun_shp_fix = mun_shp_join[mun_shp_join.index.isin(fixed_consolidated.index)]
fixed_consolidated_geo = mun_shp_fix.join(fixed_consolidated, how = 'inner')
fixed_consolidated_geo.reset_index(inplace = True)
print("Fixed geo & consolidated generated Shape:", fixed_consolidated_geo.shape)
display(fixed_consolidated_geo.head(2))
mun_shp_mob = mun_shp_join[mun_shp_join.index.isin(mobile_consolidated.index)]
mobile_consolidated_geo = mun_shp_mob.join(mobile_consolidated, how = 'inner')
mobile_consolidated_geo.reset_index(inplace = True)
print("Mobile geo & consolidated generated Shape:", mobile_consolidated_geo.shape)
display(mobile_consolidated_geo.head(2))
# Prepare data
#Prepare GDP data
gdp_mun['GDP'] = gdp_mun['GDP'].str.replace(',','').astype('float')
print("GDP per municipality table:")
display(gdp_mun.head(2))
# Fixed access penetration and population
fixed_acc_pen.rename(columns = {'No. ACCESOS FIJOS A INTERNET': 'NUM_INTERNET_ACCESS', 'POBLACIÓN DANE' : 'POPULATION', 'INDICE':'FIX_INT_PENETRATION'}, inplace = True)
fixed_acc_pen['FIX_INT_PENETRATION'] = fixed_acc_pen['FIX_INT_PENETRATION'].str.replace(',','.').astype('float')
fix_acc_pen_q1 = fixed_acc_pen.loc[((fixed_acc_pen['AÑO'] == 2020) & (fixed_acc_pen['TRIMESTRE'] == 1))][['NUM_INTERNET_ACCESS', 'POPULATION', 'FIX_INT_PENETRATION', 'DPTO_MPIO']].sort_values(by=['DPTO_MPIO'])
fix_acc_pen_q3 = fixed_acc_pen.loc[((fixed_acc_pen['AÑO'] == 2020) & (fixed_acc_pen['TRIMESTRE'] == 2))][['NUM_INTERNET_ACCESS', 'POPULATION', 'FIX_INT_PENETRATION', 'DPTO_MPIO']].sort_values(by=['DPTO_MPIO'])
print("Fixed Penetration table Q1 and Q3:")
display(fix_acc_pen_q1.head(1))
display(fix_acc_pen_q3.head(1))
#Fixed access type
fixed_acc_type.rename(columns = {'TECNOLOGIA' : 'TECHNOLOGY', 'VELOCIDAD_BAJADA': 'AVG_OFFER_DL_FIX_ACC', 'VELOCIDAD_SUBIDA' : 'AVG_OFFER_UL_FIX_ACC', 'No DE ACCESOS':'NBR_TECH_ACCESS'} , inplace = True)
#One hot type of technology
onehot_fix_tech = pd.get_dummies(fixed_acc_type[['TECHNOLOGY']], prefix = "", prefix_sep="").mul(fixed_acc_type['NBR_TECH_ACCESS'],0)
onehot_fix_group= pd.DataFrame()
onehot_fix_group['FIX_TECH_FTTH'] = onehot_fix_tech['FIBER TO THE HOME (FTTH)']
onehot_fix_group['FIX_TECH_FTTX'] = onehot_fix_tech['FIBER TO THE X (FTTX)'] + onehot_fix_tech['FIBER TO THE CABINET (FTTC)'] + onehot_fix_tech['FIBER TO THE BUILDING O FIBER'] + onehot_fix_tech['FIBER TO THE PREMISES']+ onehot_fix_tech['FIBER TO THE NODE (FTTN)']
onehot_fix_group['FIX_TECH_CABLE'] = onehot_fix_tech['CABLE'] + onehot_fix_tech['HYBRID FIBER COAXIAL (HFC)']
onehot_fix_group['FIX_TECH_COPPER'] = onehot_fix_tech['XDSL'] + onehot_fix_tech['OTRAS TECNOLOG�AS FIJAS']
onehot_fix_group['FIX_TECH_FWA'] = onehot_fix_tech['OTRAS TECNOLOG�AS INAL�MBRICA'] + onehot_fix_tech['WIFI'] + onehot_fix_tech['SATELITAL'] + onehot_fix_tech['WIMAX'] + onehot_fix_tech['FIBER TO THE ANTENNA (FTTA)']
fixed_acc_type_o = fixed_acc_type.join(onehot_fix_group)
#Onehot provider
onehot_fix_prov = pd.get_dummies(fixed_acc_type[['PROVEEDOR']], prefix = "", prefix_sep="").mul(fixed_acc_type['NBR_TECH_ACCESS'],0)
onehot_fix_prov_group= pd.DataFrame()
onehot_fix_prov_group['CLARO'] = onehot_fix_prov['COMUNICACION CELULAR S A COMCEL S A']
onehot_fix_prov_group['TIGO'] = onehot_fix_prov['UNE EPM TELECOMUNICACIONES S.A.']
onehot_fix_prov_group['MOVISTAR'] = onehot_fix_prov['COLOMBIA TELECOMUNICACIONES S.A. E.S.P.']
onehot_fix_prov_group['ETB'] = onehot_fix_prov['EMPRESA DE TELECOMUNICACIONES DE BOGOTA S.A. ESP']
onehot_fix_prov_group['EMCALI'] = onehot_fix_prov['EMPRESAS MUNICIPALES DE CALI EICE E.S.P']
onehot_fix_prov_group['DIRECTV'] = onehot_fix_prov['DIRECTV COLOMBIA LTDA']
onehot_fix_prov_group['EDATEL'] = onehot_fix_prov['EDATEL S.A.']
onehot_fix_prov_group['HV_MULT'] = onehot_fix_prov['H V TELEVISION S.A.S.']
onehot_fix_prov_group['TELEBMGA'] = onehot_fix_prov['EMPRESA DE TELECOMUNICACIONES DE BUCARAMANGA S.A. E.S.P.']
onehot_fix_prov_group['METROTEL'] = onehot_fix_prov['METROTEL S.A']
onehot_fix_prov_group['HUGHES'] = onehot_fix_prov['HUGHES DE COLOMBIA S.A.S.']
onehot_fix_prov_group['AZTECA'] = onehot_fix_prov['AZTECA COMUNICACIONES COLOMBIA S.A.S']
fixed_acc_type_o = fixed_acc_type_o.join(onehot_fix_prov_group)
fixed_acc_type_o['AVG_OFFER_DL_FIX_ACC'] = np.where(fixed_acc_type_o['AVG_OFFER_DL_FIX_ACC'] >100 , #Identifies the case to apply to
100, #This is the value that is inserted
fixed_acc_type_o['AVG_OFFER_DL_FIX_ACC'])
fixed_acc_type_o['AVG_OFFER_UL_FIX_ACC'] = np.where(fixed_acc_type_o['AVG_OFFER_UL_FIX_ACC'] >100 , #Identifies the case to apply to
100, #This is the value that is inserted
fixed_acc_type_o['AVG_OFFER_UL_FIX_ACC'])
fix_acc_type_q1 = fixed_acc_type_o.loc[((fixed_acc_type_o['AÑO'] == 2020) & (fixed_acc_type_o['TRIMESTRE'] == 1))].drop(['AÑO', 'TRIMESTRE','PROVEEDOR', 'DEPARTAMENTO', 'MUNICIPIO', 'SEGMENTO', 'TECHNOLOGY'], axis= 1)
fix_acc_type_q1_sum = fix_acc_type_q1[['DPTO_MPIO','NBR_TECH_ACCESS', 'CLARO', 'TIGO', 'MOVISTAR', 'ETB', 'EMCALI','DIRECTV','EDATEL','HV_MULT','TELEBMGA', 'METROTEL', 'HUGHES','AZTECA','FIX_TECH_FTTH', 'FIX_TECH_FTTX', 'FIX_TECH_CABLE', 'FIX_TECH_COPPER', 'FIX_TECH_FWA']].groupby(by = ['DPTO_MPIO']).sum()
fix_acc_type_q1_avg = fix_acc_type_q1[['DPTO_MPIO', 'AVG_OFFER_DL_FIX_ACC', 'AVG_OFFER_UL_FIX_ACC']].groupby(by = ['DPTO_MPIO']).mean()
fix_acc_type_q1_group = fix_acc_type_q1_sum.join(fix_acc_type_q1_avg).reset_index()
fix_acc_type_q1_group['%_FTTH'] = round((fix_acc_type_q1_group['FIX_TECH_FTTH'])/fix_acc_type_q1_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q1_group['%_FTTX'] = round((fix_acc_type_q1_group['FIX_TECH_FTTX'])/fix_acc_type_q1_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q1_group['%_CABLE'] = round((fix_acc_type_q1_group['FIX_TECH_CABLE'])/fix_acc_type_q1_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q1_group['%_FWA'] = round((fix_acc_type_q1_group['FIX_TECH_FWA'])/fix_acc_type_q1_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q1_group['%_COPPER'] = round((fix_acc_type_q1_group['FIX_TECH_COPPER'])/fix_acc_type_q1_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q1_group['%_CLARO'] = round((fix_acc_type_q1_group['CLARO'])/fix_acc_type_q1_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q1_group['%_TIGO'] = round((fix_acc_type_q1_group['TIGO'])/fix_acc_type_q1_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q1_group['%_MOVISTAR'] = round((fix_acc_type_q1_group['MOVISTAR'])/fix_acc_type_q1_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q1_group['%_ETB'] = round((fix_acc_type_q1_group['ETB'])/fix_acc_type_q1_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q1_group['%_EMCALI'] = round((fix_acc_type_q1_group['EMCALI'])/fix_acc_type_q1_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q1_group['%_DIRECTV'] = round((fix_acc_type_q1_group['DIRECTV'])/fix_acc_type_q1_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q1_group['%_EDATEL'] = round((fix_acc_type_q1_group['EDATEL'])/fix_acc_type_q1_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q1_group['%_HV_MULT'] = round((fix_acc_type_q1_group['HV_MULT'])/fix_acc_type_q1_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q1_group['%_TELEBMGA'] = round((fix_acc_type_q1_group['TELEBMGA'])/fix_acc_type_q1_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q1_group['%_METROTEL'] = round((fix_acc_type_q1_group['METROTEL'])/fix_acc_type_q1_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q1_group['%_HUGHES'] = round((fix_acc_type_q1_group['HUGHES'])/fix_acc_type_q1_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q1_group['%_AZTECA'] = round((fix_acc_type_q1_group['AZTECA'])/fix_acc_type_q1_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q1_group['%_OTHERS'] = round(1 - (fix_acc_type_q1_group[['%_CLARO', '%_TIGO', '%_MOVISTAR', '%_ETB', '%_EMCALI','%_DIRECTV','%_EDATEL','%_HV_MULT','%_TELEBMGA', '%_METROTEL', '%_HUGHES','%_AZTECA']].sum(axis = 1)),2)
fix_acc_type_q3 = fixed_acc_type_o.loc[((fixed_acc_type_o['AÑO'] == 2020) & (fixed_acc_type_o['TRIMESTRE'] == 2))].drop(['AÑO', 'TRIMESTRE','PROVEEDOR', 'DEPARTAMENTO', 'MUNICIPIO', 'SEGMENTO', 'TECHNOLOGY'], axis= 1)
fix_acc_type_q3_sum = fix_acc_type_q3[['DPTO_MPIO','NBR_TECH_ACCESS', 'CLARO', 'TIGO', 'MOVISTAR', 'ETB', 'EMCALI','DIRECTV','EDATEL','HV_MULT','TELEBMGA', 'METROTEL', 'HUGHES','AZTECA','FIX_TECH_FTTH', 'FIX_TECH_FTTX', 'FIX_TECH_CABLE', 'FIX_TECH_COPPER', 'FIX_TECH_FWA']].groupby(by = ['DPTO_MPIO']).sum()
fix_acc_type_q3_avg = fix_acc_type_q3[['DPTO_MPIO', 'AVG_OFFER_DL_FIX_ACC', 'AVG_OFFER_UL_FIX_ACC']].groupby(by = ['DPTO_MPIO']).mean()
fix_acc_type_q3_group = fix_acc_type_q3_sum.join(fix_acc_type_q3_avg).reset_index()
fix_acc_type_q3_group['%_FTTH'] = round((fix_acc_type_q3_group['FIX_TECH_FTTH'])/fix_acc_type_q3_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q3_group['%_FTTX'] = round((fix_acc_type_q3_group['FIX_TECH_FTTX'])/fix_acc_type_q3_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q3_group['%_CABLE'] = round((fix_acc_type_q3_group['FIX_TECH_CABLE'])/fix_acc_type_q3_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q3_group['%_FWA'] = round((fix_acc_type_q3_group['FIX_TECH_FWA'])/fix_acc_type_q3_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q3_group['%_COPPER'] = round((fix_acc_type_q3_group['FIX_TECH_COPPER'])/fix_acc_type_q3_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q3_group['%_CLARO'] = round((fix_acc_type_q3_group['CLARO'])/fix_acc_type_q3_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q3_group['%_TIGO'] = round((fix_acc_type_q3_group['TIGO'])/fix_acc_type_q3_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q3_group['%_MOVISTAR'] = round((fix_acc_type_q3_group['MOVISTAR'])/fix_acc_type_q3_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q3_group['%_ETB'] = round((fix_acc_type_q3_group['ETB'])/fix_acc_type_q3_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q3_group['%_EMCALI'] = round((fix_acc_type_q3_group['EMCALI'])/fix_acc_type_q3_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q3_group['%_DIRECTV'] = round((fix_acc_type_q3_group['DIRECTV'])/fix_acc_type_q3_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q3_group['%_EDATEL'] = round((fix_acc_type_q3_group['EDATEL'])/fix_acc_type_q3_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q3_group['%_HV_MULT'] = round((fix_acc_type_q3_group['HV_MULT'])/fix_acc_type_q3_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q3_group['%_TELEBMGA'] = round((fix_acc_type_q3_group['TELEBMGA'])/fix_acc_type_q3_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q3_group['%_METROTEL'] = round((fix_acc_type_q3_group['METROTEL'])/fix_acc_type_q3_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q3_group['%_HUGHES'] = round((fix_acc_type_q3_group['HUGHES'])/fix_acc_type_q3_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q3_group['%_AZTECA'] = round((fix_acc_type_q3_group['AZTECA'])/fix_acc_type_q3_group['NBR_TECH_ACCESS'],2)
fix_acc_type_q3_group['%_OTHERS'] = round(1 - (fix_acc_type_q3_group[['%_CLARO', '%_TIGO', '%_MOVISTAR', '%_ETB', '%_EMCALI','%_DIRECTV','%_EDATEL','%_HV_MULT','%_TELEBMGA', '%_METROTEL', '%_HUGHES','%_AZTECA']].sum(axis = 1)),2)
print("Fixed access type table Q1 and Q3:")
display(fix_acc_type_q1_group.head(1))
display(fix_acc_type_q3_group.head(1))
#Mobile Coverage Type
mobile_cov_tech_o = mobile_cov_tech[['AÑO','TRIMESTRE','COBERTURA 2G','COBERTURA 3G','COBERTURA HSPA+, HSPA+DC','COBERTUTA LTE','DPTO_MPIO']].replace({'S': 1, 'N': 0})
onehot_mov_prov = pd.get_dummies(mobile_cov_tech[['PROVEEDOR']], prefix = "", prefix_sep="")
onehot_mov_prov.rename(columns={"AVANTEL S.A.S" : 'AVANTEL', "COLOMBIA MOVIL S.A ESP" : 'TIGO', "COLOMBIA TELECOMUNICACIONES S.A. E.S.P.": "MOVISTAR", "COMUNICACION CELULAR S A COMCEL S A" : "CLARO", "EMPRESA DE TELECOMUNICACIONES DE BOGOTA S.A. ESP": "ETB"}, inplace = True)
mobile_cov_tech_o = mobile_cov_tech_o.join(onehot_mov_prov)
mobile_cov_tech_o['M_AVANTEL'] = mobile_cov_tech_o['AVANTEL'] * mobile_cov_tech_o[['COBERTURA 2G','COBERTURA 3G','COBERTURA HSPA+, HSPA+DC','COBERTUTA LTE']].max(axis = 1)
mobile_cov_tech_o['M_TIGO'] = mobile_cov_tech_o['TIGO'] * mobile_cov_tech_o[['COBERTURA 2G','COBERTURA 3G','COBERTURA HSPA+, HSPA+DC','COBERTUTA LTE']].max(axis = 1)
mobile_cov_tech_o['M_MOVISTAR'] = mobile_cov_tech_o['MOVISTAR'] * mobile_cov_tech_o[['COBERTURA 2G','COBERTURA 3G','COBERTURA HSPA+, HSPA+DC','COBERTUTA LTE']].max(axis = 1)
mobile_cov_tech_o['M_CLARO'] = mobile_cov_tech_o['CLARO'] * mobile_cov_tech_o[['COBERTURA 2G','COBERTURA 3G','COBERTURA HSPA+, HSPA+DC','COBERTUTA LTE']].max(axis = 1)
mobile_cov_tech_o['M_ETB'] = mobile_cov_tech_o['ETB'] * mobile_cov_tech_o[['COBERTURA 2G','COBERTURA 3G','COBERTURA HSPA+, HSPA+DC','COBERTUTA LTE']].max(axis = 1)
mobile_cov_tech_q1 = mobile_cov_tech_o.loc[((mobile_cov_tech_o['AÑO'] == 2020) & (mobile_cov_tech_o['TRIMESTRE'] == 1))]
mobile_cov_tech_q1_group = mobile_cov_tech_q1[['M_AVANTEL','M_TIGO','M_MOVISTAR','M_CLARO', 'M_ETB','COBERTURA 2G','COBERTURA 3G','COBERTURA HSPA+, HSPA+DC','COBERTUTA LTE','DPTO_MPIO']].groupby(by = ['DPTO_MPIO']).sum()
mobile_cov_tech_q1_group['%_LTE'] = round(mobile_cov_tech_q1_group['COBERTUTA LTE']/(mobile_cov_tech_q1_group[['COBERTURA 2G','COBERTURA 3G','COBERTURA HSPA+, HSPA+DC','COBERTUTA LTE']].max(axis = 1)),2)
mobile_cov_tech_q1_group['%_M_AVANTEL'] = round(mobile_cov_tech_q1_group['M_AVANTEL']/(mobile_cov_tech_q1_group[['COBERTURA 2G','COBERTURA 3G','COBERTURA HSPA+, HSPA+DC','COBERTUTA LTE']].max(axis = 1)),2)
mobile_cov_tech_q1_group['%_M_TIGO'] = round(mobile_cov_tech_q1_group['M_TIGO']/(mobile_cov_tech_q1_group[['COBERTURA 2G','COBERTURA 3G','COBERTURA HSPA+, HSPA+DC','COBERTUTA LTE']].max(axis = 1)),2)
mobile_cov_tech_q1_group['%_M_MOVISTAR'] = round(mobile_cov_tech_q1_group['M_MOVISTAR']/(mobile_cov_tech_q1_group[['COBERTURA 2G','COBERTURA 3G','COBERTURA HSPA+, HSPA+DC','COBERTUTA LTE']].max(axis = 1)),2)
mobile_cov_tech_q1_group['%_M_CLARO'] = round(mobile_cov_tech_q1_group['M_CLARO']/(mobile_cov_tech_q1_group[['COBERTURA 2G','COBERTURA 3G','COBERTURA HSPA+, HSPA+DC','COBERTUTA LTE']].max(axis = 1)),2)
mobile_cov_tech_q1_group['%_M_ETB'] = round(mobile_cov_tech_q1_group['M_ETB']/(mobile_cov_tech_q1_group[['COBERTURA 2G','COBERTURA 3G','COBERTURA HSPA+, HSPA+DC','COBERTUTA LTE']].max(axis = 1)),2)
mobile_cov_tech_q1_group = mobile_cov_tech_q1_group.reset_index()
mobile_cov_tech_q3 = mobile_cov_tech_o.loc[((mobile_cov_tech_o['AÑO'] == 2020) & (mobile_cov_tech_o['TRIMESTRE'] == 2))]
mobile_cov_tech_q3_group = mobile_cov_tech_q3[['M_AVANTEL','M_TIGO','M_MOVISTAR','M_CLARO', 'M_ETB','COBERTURA 2G','COBERTURA 3G','COBERTURA HSPA+, HSPA+DC','COBERTUTA LTE','DPTO_MPIO']].groupby(by = ['DPTO_MPIO']).sum()
mobile_cov_tech_q3_group['%_LTE'] = round(mobile_cov_tech_q3_group['COBERTUTA LTE']/(mobile_cov_tech_q3_group[['COBERTURA 2G','COBERTURA 3G','COBERTURA HSPA+, HSPA+DC','COBERTUTA LTE']].max(axis = 1)),2)
mobile_cov_tech_q3_group['%_M_AVANTEL'] = round(mobile_cov_tech_q3_group['M_AVANTEL']/(mobile_cov_tech_q3_group[['COBERTURA 2G','COBERTURA 3G','COBERTURA HSPA+, HSPA+DC','COBERTUTA LTE']].max(axis = 1)),2)
mobile_cov_tech_q3_group['%_M_TIGO'] = round(mobile_cov_tech_q3_group['M_TIGO']/(mobile_cov_tech_q3_group[['COBERTURA 2G','COBERTURA 3G','COBERTURA HSPA+, HSPA+DC','COBERTUTA LTE']].max(axis = 1)),2)
mobile_cov_tech_q3_group['%_M_MOVISTAR'] = round(mobile_cov_tech_q3_group['M_MOVISTAR']/(mobile_cov_tech_q3_group[['COBERTURA 2G','COBERTURA 3G','COBERTURA HSPA+, HSPA+DC','COBERTUTA LTE']].max(axis = 1)),2)
mobile_cov_tech_q3_group['%_M_CLARO'] = round(mobile_cov_tech_q3_group['M_CLARO']/(mobile_cov_tech_q3_group[['COBERTURA 2G','COBERTURA 3G','COBERTURA HSPA+, HSPA+DC','COBERTUTA LTE']].max(axis = 1)),2)
mobile_cov_tech_q3_group['%_M_ETB'] = round(mobile_cov_tech_q3_group['M_ETB']/(mobile_cov_tech_q3_group[['COBERTURA 2G','COBERTURA 3G','COBERTURA HSPA+, HSPA+DC','COBERTUTA LTE']].max(axis = 1)),2)
mobile_cov_tech_q3_group = mobile_cov_tech_q3_group.reset_index()
print("Mobile coverage table Q1 and Q3:")
display(mobile_cov_tech_q1_group.head(1))
display(mobile_cov_tech_q3_group.head(1))
# Copy fixed consolidated
fixed_analysis = fixed_consolidated_geo.copy()
# Join fixed penetration and population
fixed_facc_pen_join_q1 = fix_acc_pen_q1[fix_acc_pen_q1['DPTO_MPIO'].isin(fixed_consolidated_geo['DPTO_MPIO'])]
fixed_analysis = fixed_analysis.merge(fixed_facc_pen_join_q1, how = 'left', on = 'DPTO_MPIO', suffixes = (None, "_Q1"))
fixed_facc_pen_join_q3 = fix_acc_pen_q3[fix_acc_pen_q3['DPTO_MPIO'].isin(fixed_consolidated_geo['DPTO_MPIO'])]
fixed_analysis = fixed_analysis.merge(fixed_facc_pen_join_q3, how = 'left', on = 'DPTO_MPIO', suffixes = (None, "_Q3"))
# Join fixed access type
fixed_facc_type_join_q1 = fix_acc_type_q1_group[fix_acc_type_q1_group['DPTO_MPIO'].isin(fixed_consolidated_geo['DPTO_MPIO'])]
fixed_analysis = fixed_analysis.merge(fixed_facc_type_join_q1, how = 'left', on = 'DPTO_MPIO', suffixes = (None, "_Q1"))
fixed_facc_type_join_q3 = fix_acc_type_q3_group[fix_acc_type_q3_group['DPTO_MPIO'].isin(fixed_consolidated_geo['DPTO_MPIO'])]
fixed_analysis = fixed_analysis.merge(fixed_facc_type_join_q3, how = 'left', on = 'DPTO_MPIO', suffixes = (None, "_Q3"))
fixed_analysis = fixed_analysis.merge(gdp_mun[['DPTO_MPIO','GDP']], how = 'left', on = 'DPTO_MPIO')
fixed_analysis.drop(['DPTO_CCDGO', 'MPIO_CRSLC', 'MPIO_CCNCT','MPIO_NANO', 'SHAPE_AREA','SHAPE_LEN','ORIG_FID'], axis= 1, inplace = True)
print(fixed_consolidated_geo.shape)
print("Fixed consolidated with additional information generated Shape:", fixed_analysis.shape)
display(fixed_analysis.head(2))
#display(fixed_analysis.loc[(fixed_analysis.MPIO_CCDGO == 1)].sort_values(by=['POPULATION'], ascending = False).head(2))
# Copy mobile consolidated
mobile_analysis = mobile_consolidated_geo.copy()
# Join fixed penetration and population
mobile_facc_pen_join_q1 = fix_acc_pen_q1[fix_acc_pen_q1['DPTO_MPIO'].isin(mobile_consolidated_geo['DPTO_MPIO'])]
mobile_analysis = mobile_analysis.merge(mobile_facc_pen_join_q1, how = 'left', on = 'DPTO_MPIO', suffixes = (None, "_Q1"))
mobile_facc_pen_join_q3 = fix_acc_pen_q3[fix_acc_pen_q3['DPTO_MPIO'].isin(mobile_consolidated_geo['DPTO_MPIO'])]
mobile_analysis = mobile_analysis.merge(mobile_facc_pen_join_q3, how = 'left', on = 'DPTO_MPIO', suffixes = (None, "_Q3"))
# Join fixed access type
mobile_facc_type_join_q1 = fix_acc_type_q1_group[fix_acc_type_q1_group['DPTO_MPIO'].isin(mobile_consolidated_geo['DPTO_MPIO'])]
mobile_analysis = mobile_analysis.merge(mobile_facc_type_join_q1, how = 'left', on = 'DPTO_MPIO', suffixes = (None, "_Q1"))
mobile_facc_type_join_q3 = fix_acc_type_q3_group[fix_acc_type_q3_group['DPTO_MPIO'].isin(mobile_consolidated_geo['DPTO_MPIO'])]
mobile_analysis = mobile_analysis.merge(mobile_facc_type_join_q3, how = 'left', on = 'DPTO_MPIO', suffixes = (None, "_Q3"))
#Join mobile coverage type
mobile_cov_type_join_q1 = mobile_cov_tech_q1_group[mobile_cov_tech_q1_group['DPTO_MPIO'].isin(mobile_consolidated_geo['DPTO_MPIO'])]
mobile_analysis = mobile_analysis.merge(mobile_cov_type_join_q1, how = 'left', on = 'DPTO_MPIO', suffixes = (None, "_Q1"))
mobile_cov_type_join_q3 = mobile_cov_tech_q3_group[mobile_cov_tech_q3_group['DPTO_MPIO'].isin(mobile_consolidated_geo['DPTO_MPIO'])]
mobile_analysis = mobile_analysis.merge(mobile_cov_type_join_q3, how = 'left', on = 'DPTO_MPIO', suffixes = (None, "_Q3"))
mobile_analysis = mobile_analysis.merge(gdp_mun[['DPTO_MPIO','GDP']], how = 'left', on = 'DPTO_MPIO')
mobile_analysis.drop(['DPTO_CCDGO', 'MPIO_CRSLC', 'MPIO_CCNCT','MPIO_NANO', 'SHAPE_AREA','SHAPE_LEN','ORIG_FID'], axis= 1, inplace = True)
print("Mobile consolidated with additional information generated Shape:", mobile_analysis.shape)
display(mobile_analysis.head(2))
# Fixed analysis
fixed_analysis['dif_avg_d_kbps_q3_q1'] = fixed_analysis['avg_d_kbps_q3']-fixed_analysis['avg_d_kbps_q1']
fixed_analysis['%_dif_avg_d_q3_q1'] = fixed_analysis['dif_avg_d_kbps_q3_q1'] / fixed_analysis['avg_d_kbps_q1']
fixed_analysis['dif_avg_u_kbps_q3_q1'] = fixed_analysis['avg_u_kbps_q3']-fixed_analysis['avg_u_kbps_q1']
fixed_analysis['%_dif_avg_u_q3_q1'] = fixed_analysis['dif_avg_u_kbps_q3_q1'] / fixed_analysis['avg_u_kbps_q1']
fixed_analysis['dif_avg_lat_ms_q3_q1'] = fixed_analysis['avg_lat_ms_q3']-fixed_analysis['avg_lat_ms_q1']
fixed_analysis['%_dif_avg_lat_q3_q1'] = fixed_analysis['dif_avg_lat_ms_q3_q1'] / fixed_analysis['avg_lat_ms_q1']
fixed_analysis['%_devices_q3_q1'] = fixed_analysis['devices_q3']/fixed_analysis['devices_q1']
fixed_analysis['POP_DENSITY'] = fixed_analysis['POPULATION_Q3']/fixed_analysis['MPIO_NAREA']
fixed_analysis['GDP_CAPITA'] = fixed_analysis['GDP']/fixed_analysis['POPULATION_Q3']
fixed_analysis['tests_q1_per_100k_pop'] = round(fixed_analysis['tests_q1'] / fixed_analysis['POPULATION'] * 100000,0)
fixed_analysis['tests_q3_per_100k_pop'] = round(fixed_analysis['tests_q3'] / fixed_analysis['POPULATION_Q3'] * 100000,0)
fixed_analysis['dif_q3_q1_fix_access'] = fixed_analysis['NBR_TECH_ACCESS_Q3'] - fixed_analysis['NBR_TECH_ACCESS']
fixed_analysis['dif_q3_q1_FTTH'] = fixed_analysis['FIX_TECH_FTTH_Q3'] - fixed_analysis['FIX_TECH_FTTH']
fixed_analysis['dif_q3_q1_FTTX'] = fixed_analysis['FIX_TECH_FTTX_Q3'] - fixed_analysis['FIX_TECH_FTTX']
fixed_analysis['dif_q3_q1_CABLE'] = fixed_analysis['FIX_TECH_CABLE_Q3'] - fixed_analysis['FIX_TECH_CABLE']
fixed_analysis['dif_q3_q1_COPPER'] = fixed_analysis['FIX_TECH_COPPER_Q3'] - fixed_analysis['FIX_TECH_COPPER']
fixed_analysis['dif_q3_q1_FWA'] = fixed_analysis['FIX_TECH_FWA_Q3'] - fixed_analysis['FIX_TECH_FWA']
print("Fixed calculated with difference in KPIs Shape:", fixed_analysis.shape)
display(fixed_analysis.head(2))
# Mobile analysis
mobile_analysis['dif_avg_d_kbps_q3_q1'] = mobile_analysis['avg_d_kbps_q3']-mobile_analysis['avg_d_kbps_q1']
mobile_analysis['%_dif_avg_d_q3_q1'] = mobile_analysis['dif_avg_d_kbps_q3_q1'] / mobile_analysis['avg_d_kbps_q1']
mobile_analysis['dif_avg_u_kbps_q3_q1'] = mobile_analysis['avg_u_kbps_q3']-mobile_analysis['avg_u_kbps_q1']
mobile_analysis['%_dif_avg_u_q3_q1'] = mobile_analysis['dif_avg_u_kbps_q3_q1'] / mobile_analysis['avg_u_kbps_q1']
mobile_analysis['dif_avg_lat_ms_q3_q1'] = mobile_analysis['avg_lat_ms_q3']-mobile_analysis['avg_lat_ms_q1']
mobile_analysis['%_dif_avg_lat_q3_q1'] = mobile_analysis['dif_avg_lat_ms_q3_q1'] / mobile_analysis['avg_lat_ms_q1']
mobile_analysis['%_devices_q3_q1'] = mobile_analysis['devices_q3']/mobile_analysis['devices_q1']
mobile_analysis['POP_DENSITY'] = mobile_analysis['POPULATION_Q3']/mobile_analysis['MPIO_NAREA']
mobile_analysis['GDP_CAPITA'] = mobile_analysis['GDP']/mobile_analysis['POPULATION_Q3']
mobile_analysis['tests_q1_per_100k_pop'] = round(mobile_analysis['tests_q1'] / mobile_analysis['POPULATION'] * 100000,0)
mobile_analysis['tests_q3_per_100k_pop'] = round(mobile_analysis['tests_q3'] / mobile_analysis['POPULATION_Q3'] * 100000,0)
mobile_analysis['dif_q3_q1_COBERTURA 2G'] = mobile_analysis['COBERTURA 2G_Q3'] - mobile_analysis['COBERTURA 2G']
mobile_analysis['dif_q3_q1_COBERTURA 3G'] = mobile_analysis['COBERTURA 3G_Q3'] - mobile_analysis['COBERTURA 3G']
mobile_analysis['dif_q3_q1_COBERTURA LTE'] = mobile_analysis['COBERTUTA LTE_Q3'] - mobile_analysis['COBERTUTA LTE']
mobile_analysis['dif_q3_q1_%_LTE'] = mobile_analysis['%_LTE_Q3'] - mobile_analysis['%_LTE']
print("Mobile calculated with difference in KPIs Shape:", mobile_analysis.shape)
display(mobile_analysis.head(2))
fixed_analysis_no_geo = fixed_analysis.drop(['geometry'], axis = 1)
fixed_analysis_no_geo = fixed_analysis_no_geo.drop_duplicates()
print("Fixed without geometry created Shape:", fixed_analysis_no_geo.shape)
print(fixed_facc_pen_join_q1.shape)
mobile_analysis_no_geo = mobile_analysis.drop(['geometry'], axis = 1)
mobile_analysis_no_geo = mobile_analysis_no_geo.drop_duplicates()
print("Mobile without geometry created Shape:", mobile_analysis_no_geo.shape)
print(mobile_facc_pen_join_q1.shape)
fixed_analysis_no_geo.to_csv("gs://imperial_analytics_project/backup/fixed_analysis_no_geo.csv")
mobile_analysis_no_geo.to_csv("gs://imperial_analytics_project/backup/mobile_analysis_no_geo.csv")
agg_traf_day = agg_traf_op.copy()
agg_traf_base_day = agg_traf_base.copy()
agg_traf_day.rename(columns={" Fecha del día de tráfico" : 'DATE', "Tráfico Datos Total Día (GB)" : 'DAILY_TRAFFIC'}, inplace = True)
agg_traf_day.drop(['NIT Proveedor', 'Hora Pico', 'Tráfico Datos Internacional (GB)', 'Tráfico Datos NAPs - Colombia (GB)', 'Tráfico Datos Acuerdos de tránsito o peering directo (GB)', 'Tráfico Datos Local (GB)'] , axis = 1, inplace = True)
agg_traf_day['DATE'] = pd.to_datetime(agg_traf_day['DATE'], format = '%d/%m/%Y')
agg_traf_day.dtypes
agg_traf_day_group = agg_traf_day[['DATE','DAILY_TRAFFIC']].groupby(by = ['DATE']).sum().reset_index()
agg_traf_day_group['MONTH'] = pd.DatetimeIndex(agg_traf_day_group['DATE']).month
agg_traf_day_group['MONTHLY_AVG'] = agg_traf_day_group['DAILY_TRAFFIC']
agg_traf_day['TYPE'] = "daily traffic"
r = range(3, 12)
for i in r:
agg_traf_day_group['MONTHLY_AVG'] = np.where((agg_traf_day_group['MONTH'] == i), agg_traf_day_group.loc[agg_traf_day_group['MONTH'] == i]['DAILY_TRAFFIC'].mean(), agg_traf_day_group['MONTHLY_AVG'])
agg_traf_day_group = agg_traf_day_group.drop(['MONTH'], axis = 1)
agg_traf_day_group['TYPE'] = "daily traffic"
agg_traf_base_day['AVG_BASE_TRAFFIC'] = agg_traf_base_day['AVG_BASE_TRAFFIC'].str.replace(',','').astype('float')
agg_traf_base_day['DATE'] = pd.to_datetime(agg_traf_base_day['DATE'], format = '%d/%m/%Y')
agg_traf_base_day = agg_traf_base_day.drop(['NIT Proveedor'], axis = 1).rename(columns={"AVG_BASE_TRAFFIC" : 'DAILY_TRAFFIC'})
agg_traf_base_day['TYPE'] = "monthly average base"
agg_traf_base_group = agg_traf_base_day[['DATE','DAILY_TRAFFIC']].groupby(by = ['DATE']).sum().reset_index()
agg_traf_base_group['TYPE'] = "monthly average base"
#Concat DataFrames
agg_traf_total_concat = pd.concat([agg_traf_day, agg_traf_base_day])
agg_traf_group_concat = pd.concat([agg_traf_day_group, agg_traf_base_group])
display(agg_traf_group_concat.tail(2))
print(agg_traf_group_concat.shape)
display(agg_traf_total_concat.head(2))
print(agg_traf_total_concat.shape)
The data preparation process has resulted in three datasets that I will be used for the analysis
fig_dims = (20, 14)
fig, ax = plt.subplots(2,1, figsize=fig_dims)
ax[0].set(ylim=(3e7, 10e7))
sns.lineplot(data = agg_traf_group_concat, x = 'DATE', y = 'DAILY_TRAFFIC', style = 'TYPE', ax=ax[0], legend = "full")
sns.lineplot(data = agg_traf_day_group, x = 'DATE', y = 'MONTHLY_AVG', ax=ax[0], legend = "full", label = "traffic monthly average", color = "red")
g = sns.lineplot(data = agg_traf_total_concat, x = 'DATE', y = 'DAILY_TRAFFIC', style = "TYPE", hue = 'Proveedor', ax=ax[1], legend = "brief")
g.legend(ncol = 5, bbox_to_anchor=(0.9, -0.08))
Remarks: The graphs show an evident increase in the amount of traffic starting in march, jumping from a daily average of 52 to 57 PB in January and February, to a daily average of 72 to 90 PB in March to December. Colombia imposed the lockdown in March 22nd and started relaxing lockdown restrictions in May. The general quarantine was terminated in August. Charts show that the traffic has not decreased after the economy re-opened. On the contrary, the traffic seems to continue growing. This might be explained by the fact that schools are still closed, and many companies continue to work remotely. However, this might also prove that society has implemented a digital transformation that might stay in place even after the pandemic emergency is over. In that case network infrastructure will have to adapt for this new reality.
fig_dims = (20, 14)
fig, ax1 = plt.subplots(2,1, figsize=fig_dims)
ax2 = np.array([a.twinx() for a in ax1.ravel()])
fix_un_pivot_dif_acc = fixed_analysis_no_geo[['dif_q3_q1_FTTH','dif_q3_q1_FTTX', 'dif_q3_q1_CABLE', 'dif_q3_q1_COPPER', 'dif_q3_q1_FWA']].melt(var_name='TECHNOLOGY' , value_name = 'DIFFERENCE_Q3_Q1')
mob_un_pivot_dif_tech = mobile_analysis_no_geo[['dif_q3_q1_COBERTURA LTE', 'dif_q3_q1_COBERTURA 3G', 'dif_q3_q1_COBERTURA 2G']].melt(var_name='TECHNOLOGY' , value_name = 'DIFFERENCE_Q3_Q1')
sns.barplot(data = fix_un_pivot_dif_acc, x = "TECHNOLOGY", y= "DIFFERENCE_Q3_Q1", ax = ax1[0], capsize = 0.3, estimator = sum, ci = 99)
sns.barplot(data = mob_un_pivot_dif_tech, x = "TECHNOLOGY", y= "DIFFERENCE_Q3_Q1", ax = ax1[1], capsize = 0.3, estimator = sum, ci = 99)
fig_dims = (20, 20)
fig, ax = plt.subplots(2,2, figsize=fig_dims)
df_plot1 = fixed_consolidated_geo.copy()
df_plot2 = mobile_consolidated_geo.copy()
variable1 = 'avg_d_kbps_q1'
variable2 = 'avg_d_kbps_q3'
scheme_1 = mpclas.Quantiles(df_plot1[variable1], k=4)
scheme_2 = mpclas.Quantiles(df_plot2[variable1], k=4)
gpl.choropleth(df_plot1, cmap="Blues", hue=df_plot1[variable1], scheme=scheme_1, legend = True, ax = ax[0,0])
gpl.choropleth(df_plot1, cmap="Blues", hue=df_plot1[variable2], scheme=scheme_1, legend = True, ax = ax[0,1])
gpl.choropleth(df_plot2, cmap="Blues", hue=df_plot2[variable1], scheme=scheme_2, legend = True, ax = ax[1,0])
gpl.choropleth(df_plot2, cmap="Blues", hue=df_plot2[variable2], scheme=scheme_2, legend = True, ax = ax[1,1])
fig.suptitle("Average DL Speed (kbps) per Type of Connectivity and Quarter", fontsize=16)
ax[0,0].set_title("Average Fixed DL Speed Q1 (kbps)")
ax[0,1].set_title("Average Fixed DL Speed Q3 (kbps)")
ax[1,0].set_title("Average Mobile DL Speed Q1 (kbps)")
ax[1,1].set_title("Average Mobile DL Speed Q3 (kbps)")
fig.tight_layout()
fig.subplots_adjust(top=0.94)
fig_dims = (20, 20)
fig, ax = plt.subplots(2,2, figsize=fig_dims)
df_plot1 = fixed_consolidated_geo.copy()
df_plot2 = mobile_consolidated_geo.copy()
variable1 = 'avg_u_kbps_q1'
variable2 = 'avg_u_kbps_q3'
scheme_1 = mpclas.Quantiles(df_plot1[variable1], k=4)
scheme_2 = mpclas.Quantiles(df_plot2[variable1], k=4)
gpl.choropleth(df_plot1, cmap="Blues", hue=df_plot1[variable1], scheme=scheme_1, legend = True, ax = ax[0,0])
gpl.choropleth(df_plot1, cmap="Blues", hue=df_plot1[variable2], scheme=scheme_1, legend = True, ax = ax[0,1])
gpl.choropleth(df_plot2, cmap="Blues", hue=df_plot2[variable1], scheme=scheme_2, legend = True, ax = ax[1,0])
gpl.choropleth(df_plot2, cmap="Blues", hue=df_plot2[variable2], scheme=scheme_2, legend = True, ax = ax[1,1])
fig.suptitle("Average UL Speed (kbps) per Type of Connectivity and Quarter", fontsize=16)
ax[0,0].set_title("Average Fixed UL Speed Q1 (kbps)")
ax[0,1].set_title("Average Fixed UL Speed Q3 (kbps)")
ax[1,0].set_title("Average Mobile UL Speed Q1 (kbps)")
ax[1,1].set_title("Average Mobile UL Speed Q3 (kbps)")
fig.tight_layout()
fig.subplots_adjust(top=0.94)
fig_dims = (20, 20)
fig, ax = plt.subplots(2,2, figsize=fig_dims)
df_plot1 = fixed_consolidated_geo.copy()
df_plot2 = mobile_consolidated_geo.copy()
variable1 = 'avg_lat_ms_q1'
variable2 = 'avg_lat_ms_q3'
scheme_1 = mpclas.Quantiles(df_plot1[variable1], k=4)
scheme_2 = mpclas.Quantiles(df_plot2[variable1], k=4)
gpl.choropleth(df_plot1, cmap="Blues", hue=df_plot1[variable1], scheme=scheme_1, legend = True, ax = ax[0,0])
gpl.choropleth(df_plot1, cmap="Blues", hue=df_plot1[variable2], scheme=scheme_1, legend = True, ax = ax[0,1])
gpl.choropleth(df_plot2, cmap="Blues", hue=df_plot2[variable1], scheme=scheme_2, legend = True, ax = ax[1,0])
gpl.choropleth(df_plot2, cmap="Blues", hue=df_plot2[variable2], scheme=scheme_2, legend = True, ax = ax[1,1])
fig.suptitle("Average Latency (ms) per Type of Connectivity and Quarter", fontsize=16)
ax[0,0].set_title("Average Fixed Latency Q1 (ms)")
ax[0,1].set_title("Average Fixed Latency Q3 (ms)")
ax[1,0].set_title("Average Mobile Latency Q1 (ms)")
ax[1,1].set_title("Average Mobile Latency Q3 (ms)")
fig.tight_layout()
fig.subplots_adjust(top=0.94)
# Understand kpi distribution using describe function
display(fixed_analysis_no_geo[['devices_q1', 'devices_q2', 'devices_q3', 'avg_d_kbps_q1', 'avg_d_kbps_q2','avg_d_kbps_q3', 'avg_u_kbps_q1', 'avg_u_kbps_q2' , 'avg_u_kbps_q3', 'avg_lat_ms_q1', 'avg_lat_ms_q2', 'avg_lat_ms_q3']].describe())
display(mobile_analysis_no_geo[['devices_q1', 'devices_q2', 'devices_q3', 'avg_d_kbps_q1', 'avg_d_kbps_q2','avg_d_kbps_q3', 'avg_u_kbps_q1', 'avg_u_kbps_q2' , 'avg_u_kbps_q3', 'avg_lat_ms_q1', 'avg_lat_ms_q2', 'avg_lat_ms_q3']].describe())
fix_un_pivot_kpi = fixed_analysis_no_geo[['avg_d_kbps_q1','avg_d_kbps_q2','avg_d_kbps_q3']].melt(var_name='Quarter_DL' , value_name = 'Avg_DL_kbps')
fix_un_pivot_kpi = fix_un_pivot_kpi.join(fixed_analysis_no_geo[['avg_u_kbps_q1','avg_u_kbps_q2','avg_u_kbps_q3']].melt(var_name='Quarter_UL' , value_name = 'Avg_UL_kbps'))
fix_un_pivot_kpi = fix_un_pivot_kpi.join(fixed_analysis_no_geo[['avg_lat_ms_q1','avg_lat_ms_q2','avg_lat_ms_q3']].melt(var_name='Quarter_Lat' , value_name = 'Avg_Lat_ms'))
fix_un_pivot_kpi = fix_un_pivot_kpi.join(fixed_analysis_no_geo[['devices_q1','devices_q2','devices_q3']].melt(var_name='Quarter_dev' , value_name = 'Number_unique_devices'))
fix_un_pivot_kpi['Quarter'] = fix_un_pivot_kpi['Quarter_DL'].str[-2:]
fix_un_pivot_kpi = fix_un_pivot_kpi.drop(['Quarter_DL', 'Quarter_UL', 'Quarter_Lat', 'Quarter_dev'], axis = 1)
fig, ax1 = plt.subplots(1,2, figsize=(23,6))
sns.lineplot(data=fix_un_pivot_kpi, x="Quarter", y="Avg_DL_kbps", label= "Avg_Downlink_Speed", ax = ax1[0], err_style = "bars" )
sns.lineplot(data=fix_un_pivot_kpi, x="Quarter", y="Avg_UL_kbps", label= "Avg_Upwnlink_Speed", ax = ax1[0], err_style = "bars")
sns.lineplot(data=fix_un_pivot_kpi, x="Quarter", y="Avg_Lat_ms", ax = ax1[1], err_style = "bars")
ax1[0].legend(loc='lower right', fontsize = 12)
ax1[0].set_title("Fixed Average DL/UL Speed KPIs Evolution per Quarter")
ax1[0].set_ylabel('Average Speed (kbps)')
ax1[1].set_title("Fixed Average Latency KPI Evolution per Quarter")
mob_un_pivot_kpi = mobile_analysis_no_geo[['avg_d_kbps_q1','avg_d_kbps_q2','avg_d_kbps_q3']].melt(var_name='Quarter_DL' , value_name = 'Avg_DL_kbps')
mob_un_pivot_kpi = mob_un_pivot_kpi.join(mobile_analysis_no_geo[['avg_u_kbps_q1','avg_u_kbps_q2','avg_u_kbps_q3']].melt(var_name='Quarter_UL' , value_name = 'Avg_UL_kbps'))
mob_un_pivot_kpi = mob_un_pivot_kpi.join(mobile_analysis_no_geo[['avg_lat_ms_q1','avg_lat_ms_q2','avg_lat_ms_q3']].melt(var_name='Quarter_Lat' , value_name = 'Avg_Lat_ms'))
mob_un_pivot_kpi = mob_un_pivot_kpi.join(mobile_analysis_no_geo[['devices_q1','devices_q2','devices_q3']].melt(var_name='Quarter_dev' , value_name = 'Number_unique_devices'))
mob_un_pivot_kpi['Quarter'] = mob_un_pivot_kpi['Quarter_DL'].str[-2:]
mob_un_pivot_kpi = mob_un_pivot_kpi.drop(['Quarter_DL', 'Quarter_UL', 'Quarter_Lat', 'Quarter_dev'], axis = 1)
fig, ax1 = plt.subplots(1,2, figsize=(23,6))
sns.lineplot(data=mob_un_pivot_kpi, x="Quarter", y="Avg_DL_kbps", label= "Avg_Downlink_Speed", ax = ax1[0], err_style = "bars" )
sns.lineplot(data=mob_un_pivot_kpi, x="Quarter", y="Avg_UL_kbps", label= "Avg_Upwnlink_Speed", ax = ax1[0], err_style = "bars")
sns.lineplot(data=mob_un_pivot_kpi, x="Quarter", y="Avg_Lat_ms", ax = ax1[1], err_style = "bars")
ax1[0].legend(loc='center right', fontsize = 12)
ax1[0].set_title("Mobile Average DL/UL Speed KPIs Evolution per Quarter")
ax1[0].set_ylabel('Average Speed (kbps)')
ax1[1].set_title("Mobile Average Latency KPI Evolution per Quarter")
# Plot Average DL Distribution
fig, ax1 = plt.subplots(1,2 , figsize=(23,6))
variable1 = 'avg_d_kbps_q1'
variable2 = 'avg_d_kbps_q3'
offset = 0
bins = np.linspace(0, 50000, 100)
xx = np.linspace(0, 50000, 1000)
fxq1 = fixed_analysis_no_geo[variable1]
fkdeq1 = stats.gaussian_kde(fxq1)
fxq3 = fixed_analysis_no_geo[variable2]
fkdeq3 = stats.gaussian_kde(fxq3)
mxq1 = mobile_analysis_no_geo[variable1]
mkdeq1 = stats.gaussian_kde(mxq1)
mxq3 = mobile_analysis_no_geo[variable2]
mkdeq3 = stats.gaussian_kde(mxq3)
ax2 = np.array([a.twinx() for a in ax1.ravel()])
ax1[0].hist(fxq1, bins= bins - offset, alpha=0.2, label='Average DL Speed Q1')
ax1[0].hist(fxq3, bins= bins- offset, alpha=0.2, label='Average DL Speed Q3')
ax2[0].plot(xx- offset, fkdeq1(xx), label = "Density_Function_Q1")
ax2[0].plot(xx - offset, fkdeq3(xx), label = "Density_Function_Q3")
ax1[0].set_xlabel('Average DL Speed (kbps)')
ax1[0].set_ylabel('Amount of Municipalities')
ax2[0].set_ylabel('KDE Distribution')
ax1[0].legend(loc='upper right')
ax2[0].legend(bbox_to_anchor=(1, 0.85))
ax2[0].set_ylim(0,0.000115)
ax1[0].set_title("Average Fixed DL Speed Distribution")
ax1[1].hist(mxq1, bins= bins - offset, alpha=0.2, label='Average DL Speed Q1')
ax1[1].hist(mxq3, bins= bins- offset, alpha=0.2, label='Average DL Speed Q3')
ax2[1].plot(xx - offset, mkdeq1(xx), label = "Density_Function_Q1")
ax2[1].plot(xx - offset, mkdeq3(xx), label = "Density_Function_Q3")
ax1[1].set_xlabel('Average DL Speed (kbps)')
ax1[1].set_ylabel('Amount of Municipalities')
ax2[1].set_ylabel('KDE Distribution')
ax1[1].legend(loc='upper right')
ax2[1].legend(bbox_to_anchor=(1, 0.85))
ax2[1].set_ylim(0,4.3e-5)
ax1[1].set_title("Average Mobile DL Speed Distribution")
fig.suptitle("Average DL Speed Distribution per Type of Connectivity and Quarter", fontsize=16)
plt.show()
# Plot Average UL Distribution
fig, ax1 = plt.subplots(1,2 , figsize=(23,6))
variable1 = 'avg_u_kbps_q1'
variable2 = 'avg_u_kbps_q3'
offset = 0
bins = np.linspace(0, 40000, 100)
xx = np.linspace(0, 40000, 1000)
fxq1 = fixed_analysis_no_geo[variable1]
fkdeq1 = stats.gaussian_kde(fxq1)
fxq3 = fixed_analysis_no_geo[variable2]
fkdeq3 = stats.gaussian_kde(fxq3)
mxq1 = mobile_analysis_no_geo[variable1]
mkdeq1 = stats.gaussian_kde(mxq1)
mxq3 = mobile_analysis_no_geo[variable2]
mkdeq3 = stats.gaussian_kde(mxq3)
ax2 = np.array([a.twinx() for a in ax1.ravel()])
ax1[0].hist(fxq1, bins= bins - offset, alpha=0.2, label='Average UL Speed Q1')
ax1[0].hist(fxq3, bins= bins- offset, alpha=0.2, label='Average UL Speed Q3')
ax2[0].plot(xx- offset, fkdeq1(xx), label = "Density_Function_Q1")
ax2[0].plot(xx - offset, fkdeq3(xx), label = "Density_Function_Q3")
ax1[0].set_xlabel('Average UL Speed (kbps)')
ax1[0].set_ylabel('Amount of Municipalities')
ax2[0].set_ylabel('KDE Distribution')
ax1[0].legend(loc='upper right')
ax2[0].legend(bbox_to_anchor=(1, 0.85))
ax2[0].set_ylim(0, 1.55e-4)
ax1[0].set_title("Average Fixed UL Speed Distribution")
ax1[1].hist(mxq1, bins= bins - offset, alpha=0.2, label='Average UL Speed Q1')
ax1[1].hist(mxq3, bins= bins- offset, alpha=0.2, label='Average UL Speed Q3')
ax2[1].plot(xx- offset, mkdeq1(xx), label = "Density_Function_Q1")
ax2[1].plot(xx - offset, mkdeq3(xx), label = "Density_Function_Q3")
ax1[1].set_xlabel('Average UL Speed (kbps)')
ax1[1].set_ylabel('Amount of Municipalities')
ax2[1].set_ylabel('KDE Distribution')
ax1[1].legend(loc='upper right')
ax2[1].legend(bbox_to_anchor=(1, 0.85))
ax2[1].set_ylim(0,0.9e-4)
ax1[1].set_title("Average Mobile UL Speed Distribution")
fig.suptitle("Average UL Speed Distribution per Type of Connectivity and Quarter", fontsize=16)
plt.show()
# Plot Average UL Distribution
fig, ax1 = plt.subplots(1,2 , figsize=(23,6))
variable1 = 'avg_lat_ms_q1'
variable2 = 'avg_lat_ms_q3'
offset = 0
bins = np.linspace(0, 400, 100)
xx = np.linspace(0, 400, 1000)
fxq1 = fixed_analysis_no_geo[variable1]
fkdeq1 = stats.gaussian_kde(fxq1)
fxq3 = fixed_analysis_no_geo[variable2]
fkdeq3 = stats.gaussian_kde(fxq3)
mxq1 = mobile_analysis_no_geo[variable1]
mkdeq1 = stats.gaussian_kde(mxq1)
mxq3 = mobile_analysis_no_geo[variable2]
mkdeq3 = stats.gaussian_kde(mxq3)
ax2 = np.array([a.twinx() for a in ax1.ravel()])
ax1[0].hist(fxq1, bins= bins - offset, alpha=0.2, label='Average Latency Q1')
ax1[0].hist(fxq3, bins= bins- offset, alpha=0.2, label='Average Latency Q3')
ax2[0].plot(xx- offset, fkdeq1(xx), label = "Density_Function_Q1")
ax2[0].plot(xx - offset, fkdeq3(xx), label = "Density_Function_Q3")
ax1[0].set_xlabel('Average Latency (ms)')
ax1[0].set_ylabel('Amount of Municipalities')
ax2[0].set_ylabel('KDE Distribution')
ax1[0].legend(loc='upper right')
ax2[0].legend(bbox_to_anchor=(1, 0.85))
ax2[0].set_ylim(0, 0.0095)
ax1[0].set_title("Average Fixed Latency Distribution")
ax1[1].hist(mxq1, bins= bins - offset, alpha=0.2, label='Average Latency Q1')
ax1[1].hist(mxq3, bins= bins- offset, alpha=0.2, label='Average Latency Q3')
ax2[1].plot(xx- offset, mkdeq1(xx), label = "Density_Function_Q1")
ax2[1].plot(xx - offset, mkdeq3(xx), label = "Density_Function_Q3")
ax1[1].set_xlabel('Average Latency (ms)')
ax1[1].set_ylabel('Amount of Municipalities')
ax2[1].set_ylabel('KDE Distribution')
ax1[1].legend(loc='upper right')
ax2[1].legend(bbox_to_anchor=(1, 0.85))
ax2[1].set_ylim(0,0.029)
ax1[1].set_title("Average Mobile Latency Distribution")
fig.suptitle("Average Latency Distribution per Type of Connectivity and Quarter", fontsize=16)
plt.show()
# Plot Average DL Distribution difference
fig, ax1 = plt.subplots(1,2 , figsize=(23,6))
variable1 = 'dif_avg_d_kbps_q3_q1'
offset = 1
bins = np.linspace(-20000, 20000, 200)
xx = np.linspace(-20000, 20000, 1000)
fx = fixed_analysis_no_geo[variable1]
fkde = stats.gaussian_kde(fx)
mx = mobile_analysis_no_geo[variable1]
mkde = stats.gaussian_kde(mx)
ax2 = np.array([a.twinx() for a in ax1.ravel()])
ax1[0].hist(fx, bins= bins - offset, alpha=0.2, label='Average DL Speed Difference (Q3-Q1)')
ax2[0].plot(xx- offset, fkde(xx), label = "Density_Function")
ax1[0].set_xlabel('Average DL Speed (kbps)')
ax1[0].set_ylabel('Amount of Municipalities')
ax2[0].set_ylabel('KDE Distribution')
ax1[0].legend(loc='upper right')
#ax2[0].legend(bbox_to_anchor=(1, 0.85))
#ax2[0].set_ylim(0, 0.007)
ax1[0].set_title("Average Fixed DL Speed Difference (Q3-Q1) Distribution")
ax1[1].hist(mx, bins= bins - offset, alpha=0.2, label='Average DL Speed Difference (Q3-Q1)')
ax2[1].plot(xx- offset, mkde(xx), label = "Density_Function")
ax1[1].set_xlabel('Average DL Speed (kbps)')
ax1[1].set_ylabel('Amount of Municipalities')
ax2[1].set_ylabel('KDE Distribution')
ax1[1].legend(loc='upper right')
#ax2[1].legend(bbox_to_anchor=(1, 0.85))
#ax2[1].set_ylim(0,0.007)
ax1[1].set_title("Average Mobile DL Speed Difference (Q3-Q1) Distribution")
fig.suptitle("Average DL Speed Difference (Q3-Q1) Distribution per Type of Connectivity", fontsize=16)
plt.show()
# Plot Average UL Distribution difference
fig, ax1 = plt.subplots(1,2 , figsize=(23,6))
variable1 = 'dif_avg_u_kbps_q3_q1'
offset = 1
bins = np.linspace(-20000, 20000, 200)
xx = np.linspace(-20000, 20000, 1000)
fx = fixed_analysis_no_geo[variable1]
fkde = stats.gaussian_kde(fx)
mx = mobile_analysis_no_geo[variable1]
mkde = stats.gaussian_kde(mx)
ax2 = np.array([a.twinx() for a in ax1.ravel()])
ax1[0].hist(fx, bins= bins - offset, alpha=0.2, label='Average UL Speed Difference (Q3-Q1)')
ax2[0].plot(xx- offset, fkde(xx), label = "Density_Function")
ax1[0].set_xlabel('Average UL Speed (kbps)')
ax1[0].set_ylabel('Amount of Municipalities')
ax2[0].set_ylabel('KDE Distribution')
ax1[0].legend(loc='upper right')
#ax2[0].legend(bbox_to_anchor=(1, 0.85))
#ax2[0].set_ylim(0, 0.007)
ax1[0].set_title("Average Fixed UL Speed Difference (Q3-Q1) Distribution")
ax1[1].hist(mx, bins= bins - offset, alpha=0.2, label='Average UL Speed Difference (Q3-Q1)')
ax2[1].plot(xx- offset, mkde(xx), label = "Density_Function")
ax1[1].set_xlabel('Average UL Speed (kbps)')
ax1[1].set_ylabel('Amount of Municipalities')
ax2[1].set_ylabel('KDE Distribution')
ax1[1].legend(loc='upper right')
#ax2[1].legend(bbox_to_anchor=(1, 0.85))
#ax2[1].set_ylim(0,0.007)
ax1[1].set_title("Average Mobile UL Speed Difference (Q3-Q1) Distribution")
fig.suptitle("Average UL Speed Difference (Q3-Q1) Distribution per Type of Connectivity", fontsize=16)
plt.show()
# Plot Average Latency Distribution difference
fig, ax1 = plt.subplots(1,2 , figsize=(23,6))
variable1 = 'dif_avg_lat_ms_q3_q1'
offset = 1
bins = np.linspace(-100, 100, 200)
xx = np.linspace(-100, 100, 1000)
fx = fixed_analysis_no_geo[variable1]
fkde = stats.gaussian_kde(fx)
mx = mobile_analysis_no_geo[variable1]
mkde = stats.gaussian_kde(mx)
ax2 = np.array([a.twinx() for a in ax1.ravel()])
ax1[0].hist(fx, bins= bins - offset, alpha=0.2, label='Average Latency Difference (Q3-Q1)')
ax2[0].plot(xx- offset, fkde(xx), label = "Density_Function")
ax1[0].set_xlabel('Average Latency (ms)')
ax1[0].set_ylabel('Amount of Municipalities')
ax2[0].set_ylabel('KDE Distribution')
ax1[0].legend(loc='upper right')
#ax2[0].legend(bbox_to_anchor=(1, 0.85))
#ax2[0].set_ylim(0, 0.007)
ax1[0].set_title("Average Fixed Latency Difference (Q3-Q1) Distribution")
ax1[1].hist(mx, bins= bins - offset, alpha=0.2, label='Average UL Speed Difference (Q3-Q1)')
ax2[1].plot(xx- offset, mkde(xx), label = "Density_Function")
ax1[1].set_xlabel('Average Latency (ms)')
ax1[1].set_ylabel('Amount of Municipalities')
ax2[1].set_ylabel('KDE Distribution')
ax1[1].legend(loc='upper right')
#ax2[1].legend(bbox_to_anchor=(1, 0.85))
#ax2[1].set_ylim(0,0.007)
ax1[1].set_title("Average Mobile Latency Difference (Q3-Q1) Distribution")
fig.suptitle("Average Latency Difference (Q3-Q1) Distribution per Type of Connectivity", fontsize=16)
plt.show()
The distribution plots did not show an evident deterioration on the mean and the overall distribution when considering all the municipalities. However, distribution of the KPI difference between Q3 and Q1 charts also show that there is a severe deterioration for certain group of municipalities, specially for Uplink Speed and Latency KPIs. This section plots the distribution of the % deterioration for each KPI.
Although the datasets include information for 967 municipalities for fixed networks and 948 municipalities for mobile networks, many of the municipalities just included a couple of tests and 1 or 2 unique devices. A minimum amount tests per municipality shall be defined prior to any further analysis. The average population size for Colombian municipalities is only 43.816 inhabitants. Sampling design theory dictates that a random of sample of around 740 should be used to reach 90% confidence levels and 3% error. This ideal sample size is far from the volumes that Ookla dataset has, besides not being randomized at all. I decided then to define a minimum threshold based on the distribution of tests per 100.000 inhabitants in the dataset.
The graph shows that 122 municipalities had less than 50 tests per 100.000 inhabitants in Q1 or Q3 for fixed networks, and 340 municipalities had less than 50 tests per 100.000 inhabitants in Q1 or Q3 for mobile networks. Although 50 tests per 100.000 inhabitants is a sample size that is far from ideal, increasing the threshold would result in excluding of more than 50% of the base of municipalities.
fig, ax1 = plt.subplots(1,2 , figsize=(23,6))
bins = np.linspace(0, 1000, 1000)
fx = fixed_analysis_no_geo[['tests_q1_per_100k_pop','tests_q3_per_100k_pop']].min(axis = 1)
mx = mobile_analysis_no_geo[['tests_q1_per_100k_pop','tests_q3_per_100k_pop']].min(axis = 1)
ax2 = np.array([a.twinx() for a in ax1.ravel()])
ax1[0].hist(fx, bins= bins, alpha=0.5, label='Min of Q1 and Q3 for amount of tests per 100k inhab.')
ax2[0].hist(fx, bins= bins, histtype='step', cumulative=1)
ax1[0].set_xlabel('Amount of tests per 100k inhabitants')
ax2[0].set_ylabel('Amount of Municipalities - Cummulative Distribution')
ax1[0].set_ylabel('Amount of Municipalities - Histogram')
ax1[0].legend(loc='upper left')
ax1[0].set_title("Fixed Min of Q1 and Q3 for amount of tests per 100k inhab.", fontsize = 14)
ax1[0].set_xticks(np.arange(0, 1000, step=50))
ax1[0].grid( linestyle='--')
ax1[1].hist(mx, bins= bins, alpha=0.5, label='Min of Q1 and Q3 for amount of tests per 100k inhab.')
ax2[1].hist(mx, bins= bins, histtype='step', cumulative=1)
ax1[1].set_xlabel('Amount of tests per 100k inhabitants')
ax2[1].set_ylabel('Amount of Municipalities - Cummulative Distribution')
ax1[1].set_ylabel('Amount of Municipalities - Histogram')
ax1[1].legend(loc='upper left')
ax1[1].set_title("Mobile Min of Q1 and Q3 for amount of tests per 100k inhab.", fontsize = 14)
ax1[1].set_xticks(np.arange(0, 1000, step=50))
ax1[1].grid( linestyle='--')
# Discard municipalities with less than 50 tests per 100k inhabitants in either Q1 or Q3
thr_test_fix = 50
thr_test_mob = 50
fixed_analysis_no_geo_thr = fixed_analysis_no_geo.loc[(fixed_analysis_no_geo['tests_q1_per_100k_pop'] > thr_test_fix) & (fixed_analysis_no_geo['tests_q3_per_100k_pop'] > thr_test_fix)]
mobile_analysis_no_geo_thr = mobile_analysis_no_geo.loc[(mobile_analysis_no_geo['tests_q1_per_100k_pop'] > thr_test_mob) & (mobile_analysis_no_geo['tests_q3_per_100k_pop'] > thr_test_mob)]
print("The shape of the fixed base excluding municipalities with less than", thr_test_fix, "tests per 100k inhabitants: ", fixed_analysis_no_geo_thr.shape)
#display(fixed_analysis_no_geo_thr.head())
print("The shape of the mobile base excluding municipalities with less than", thr_test_mob, "tests per 100k inhabitants: ", mobile_analysis_no_geo_thr.shape)
#display(mobile_analysis_no_geo_dev.head())
Just as predicted by my hypothesis, even though the KPI distribution of all the municipalities showed no significant differences between Q1 and Q3, there is a significant group of municipalities that is showing severe affectation:
I defined 30% as the threshold for significant deterioration in the KPIs for the following steps of the analysis
# Deterioration for fixed network base
deterioration_fixed = pd.DataFrame()
deterioration_fixed['Deterioriation_Threshold'] = np.arange(0.1, 0.85, step=0.05)
deterioration_fixed['Municipalities with DL deterioration'] = np.nan
deterioration_fixed['Municipalities with UL deterioration'] = np.nan
deterioration_fixed['Municipalities with Latency deterioration'] = np.nan
deterioration_fixed['Municipalities with deterioration in any KPI beyond threshold'] = np.nan
for index, row in deterioration_fixed.iterrows():
row['Municipalities with DL deterioration'] = fixed_analysis_no_geo_thr.loc[(fixed_analysis_no_geo_thr['%_dif_avg_d_q3_q1'] < -row['Deterioriation_Threshold'])].shape[0]
row['Municipalities with UL deterioration'] = fixed_analysis_no_geo_thr.loc[(fixed_analysis_no_geo_thr['%_dif_avg_u_q3_q1'] < -row['Deterioriation_Threshold'])].shape[0]
row['Municipalities with Latency deterioration'] = fixed_analysis_no_geo_thr.loc[(fixed_analysis_no_geo_thr['%_dif_avg_lat_q3_q1'] > row['Deterioriation_Threshold'])].shape[0]
row['Municipalities with deterioration in any KPI beyond threshold'] = fixed_analysis_no_geo_thr.loc[(fixed_analysis_no_geo_thr['%_dif_avg_d_q3_q1'] < -row['Deterioriation_Threshold'])|(fixed_analysis_no_geo_thr['%_dif_avg_u_q3_q1'] < -row['Deterioriation_Threshold'])|(fixed_analysis_no_geo_thr['%_dif_avg_lat_q3_q1'] > row['Deterioriation_Threshold'])].shape[0]
deterioration_fixed.head()
# Deterioration for mobile network base
deterioration_mobile = pd.DataFrame()
deterioration_mobile['Deterioriation_Threshold'] = np.arange(0.10, 0.85, step=0.05)
deterioration_mobile['Municipalities with DL deterioration'] = np.nan
deterioration_mobile['Municipalities with UL deterioration'] = np.nan
deterioration_mobile['Municipalities with Latency deterioration'] = np.nan
deterioration_mobile['Municipalities with deterioration in any KPI beyond threshold'] = np.nan
for index, row in deterioration_mobile.iterrows():
row['Municipalities with DL deterioration'] = mobile_analysis_no_geo_thr.loc[(mobile_analysis_no_geo_thr['%_dif_avg_d_q3_q1'] < -row['Deterioriation_Threshold'])].shape[0]
row['Municipalities with UL deterioration'] = mobile_analysis_no_geo_thr.loc[(mobile_analysis_no_geo_thr['%_dif_avg_u_q3_q1'] < -row['Deterioriation_Threshold'])].shape[0]
row['Municipalities with Latency deterioration'] = mobile_analysis_no_geo_thr.loc[(mobile_analysis_no_geo_thr['%_dif_avg_lat_q3_q1'] > row['Deterioriation_Threshold'])].shape[0]
row['Municipalities with deterioration in any KPI beyond threshold'] = mobile_analysis_no_geo_thr.loc[(mobile_analysis_no_geo_thr['%_dif_avg_d_q3_q1'] < -row['Deterioriation_Threshold'])|(mobile_analysis_no_geo_thr['%_dif_avg_u_q3_q1'] < -row['Deterioriation_Threshold'])|(mobile_analysis_no_geo_thr['%_dif_avg_lat_q3_q1'] > row['Deterioriation_Threshold'])].shape[0]
deterioration_mobile.head()
fig, ax1 = plt.subplots(1,2 , figsize=(23,6))
ax1[0].plot(deterioration_fixed['Deterioriation_Threshold'],deterioration_fixed['Municipalities with DL deterioration'], label= 'Number of municipalities with DL deterioration')
ax1[0].plot(deterioration_fixed['Deterioriation_Threshold'],deterioration_fixed['Municipalities with UL deterioration'], label= 'Number of municipalities with UL deterioration')
ax1[0].plot(deterioration_fixed['Deterioriation_Threshold'],deterioration_fixed['Municipalities with Latency deterioration'], label= 'Number of municipalities with Latency deterioration')
ax1[0].plot(deterioration_fixed['Deterioriation_Threshold'],deterioration_fixed['Municipalities with deterioration in any KPI beyond threshold'], label= 'Number of municipalities with deterioration in any KPI beyond threshold')
ax1[0].set_xticks(np.arange(0.1, 0.85, step=0.05))
ax1[0].set_yticks(np.arange(0, 550, step=50))
ax1[0].set_ylim(0, 550)
ax1[0].legend(loc='upper right')
ax1[0].grid()
ax1[0].set_xlabel('KPI deterioration (1- [KPI Q3 / KPI Q1])')
ax1[0].set_ylabel('Amount of Municipalities')
ax1[0].set_title("Fixed Network - Number of Municipalities vs KPI Deterioration", fontsize = 14)
ax1[1].plot(deterioration_mobile['Deterioriation_Threshold'],deterioration_mobile['Municipalities with DL deterioration'], label= 'Number of municipalities with DL deterioration')
ax1[1].plot(deterioration_mobile['Deterioriation_Threshold'],deterioration_mobile['Municipalities with UL deterioration'], label= 'Number of municipalities with UL deterioration')
ax1[1].plot(deterioration_mobile['Deterioriation_Threshold'],deterioration_mobile['Municipalities with Latency deterioration'], label= 'Number of municipalities with Latency deterioration')
ax1[1].plot(deterioration_mobile['Deterioriation_Threshold'],deterioration_mobile['Municipalities with deterioration in any KPI beyond threshold'], label= 'Number of municipalities with deterioration in any KPI beyond threshold')
ax1[1].set_xticks(np.arange(0.1 , 0.85, step=0.05))
ax1[1].set_yticks(np.arange(0, 550, step=50))
ax1[1].set_ylim(0, 550)
ax1[1].legend(loc='upper right')
ax1[1].grid()
ax1[1].set_xlabel('KPI deterioration (1- [KPI Q3 / KPI Q1])')
ax1[1].set_ylabel('Amount of Municipalities')
ax1[1].set_title("Mobile Network - Number of Municipalities vs KPI Deterioration", fontsize = 14)
kpi_thr = 0.30
fixed_analysis_no_geo_thr['Significant_deterioration'] = np.where((fixed_analysis_no_geo_thr['%_dif_avg_d_q3_q1'] < -kpi_thr)|(fixed_analysis_no_geo_thr['%_dif_avg_u_q3_q1'] < -kpi_thr)|(fixed_analysis_no_geo_thr['%_dif_avg_lat_q3_q1'] > kpi_thr), #Identifies the case to apply to
1, #This is the value that is inserted
0)
mobile_analysis_no_geo_thr['Significant_deterioration'] = np.where((mobile_analysis_no_geo_thr['%_dif_avg_d_q3_q1'] < -kpi_thr)|(mobile_analysis_no_geo_thr['%_dif_avg_u_q3_q1'] < -kpi_thr)|(mobile_analysis_no_geo_thr['%_dif_avg_lat_q3_q1'] > kpi_thr), #Identifies the case to apply to
1, #This is the value that is inserted
0)
display(mobile_analysis_no_geo_thr.head(2))
display(fixed_analysis_no_geo_thr.head(2))
fixed_consolidated_geo_thr = fixed_consolidated_geo.merge(fixed_analysis_no_geo_thr[['DPTO_MPIO','Significant_deterioration']], how = 'inner', on = 'DPTO_MPIO')
print(fixed_consolidated_geo_thr.shape)
display(fixed_consolidated_geo_thr.head(1))
mobile_consolidated_geo_thr = mobile_consolidated_geo.merge(mobile_analysis_no_geo_thr[['DPTO_MPIO','Significant_deterioration']], how = 'inner', on = 'DPTO_MPIO')
print(mobile_consolidated_geo_thr.shape)
display(mobile_consolidated_geo_thr.head(1))
fig_dims = (20, 10)
fig, ax = plt.subplots(1,2, figsize=fig_dims)
df_plot1 = fixed_consolidated_geo_thr.copy()
df_plot2 = mobile_consolidated_geo_thr.copy()
variable1 = 'Significant_deterioration'
scheme_1 = mpclas.Quantiles(df_plot1[variable1], k=2)
scheme_2 = mpclas.Quantiles(df_plot2[variable1], k=2)
gpl.choropleth(df_plot1, cmap="Reds", hue=df_plot1[variable1], scheme=scheme_1, legend = True, ax = ax[0])
gpl.choropleth(df_plot2, cmap="Reds", hue=df_plot2[variable1], scheme=scheme_2, legend = True, ax = ax[1])
fig.suptitle("Municipalities with KPI deterioration", fontsize=16)
ax[0].set_title("Municipalities with any Fixed KPI deteriorated by 30% or more")
ax[1].set_title("Municipalities with any Mobile KPI deteriorated by 30% or more")
fig.tight_layout()
fig.subplots_adjust(top=0.94)
#fixed_analysis_no_geo_thr.to_csv("gs://imperial_analytics_project/backup/fixed_analysis_no_geo_thr.csv")
#mobile_analysis_no_geo_thr.to_csv("gs://imperial_analytics_project/backup/mobile_analysis_no_geo_thr.csv")
fixed_analysis_no_geo_thr= pd.read_csv("gs://imperial_analytics_project/backup/fixed_analysis_no_geo_thr.csv").drop(['Unnamed: 0'], axis=1)
mobile_analysis_no_geo_thr= pd.read_csv("gs://imperial_analytics_project/backup/mobile_analysis_no_geo_thr.csv").drop(['Unnamed: 0'], axis=1)
features_general = ['MPIO_CCDGO', 'POPULATION_Q3', 'POP_DENSITY', 'GDP_CAPITA', '%_dif_avg_d_q3_q1', '%_dif_avg_u_q3_q1', '%_dif_avg_lat_q3_q1', 'Significant_deterioration']
features_vendor = ['%_CLARO_Q3', '%_TIGO_Q3', '%_MOVISTAR_Q3', '%_ETB_Q3', '%_EMCALI_Q3', '%_DIRECTV_Q3', '%_EDATEL_Q3', '%_HV_MULT_Q3', '%_HUGHES_Q3', '%_AZTECA_Q3', '%_OTHERS', '%_dif_avg_d_q3_q1', '%_dif_avg_u_q3_q1', '%_dif_avg_lat_q3_q1', 'Significant_deterioration']
features_tech = ['FIX_INT_PENETRATION_Q3', '%_FTTH_Q3', '%_FTTX_Q3', '%_CABLE_Q3', '%_FWA_Q3', '%_COPPER_Q3', 'avg_d_kbps_q1', 'avg_u_kbps_q1', 'avg_lat_ms_q1','%_dif_avg_d_q3_q1', '%_dif_avg_u_q3_q1', '%_dif_avg_lat_q3_q1', 'Significant_deterioration']
features_tech_dif = ['dif_q3_q1_fix_access', 'dif_q3_q1_FTTH', 'dif_q3_q1_FTTX', 'dif_q3_q1_CABLE', 'dif_q3_q1_FWA', 'dif_q3_q1_COPPER','%_dif_avg_d_q3_q1', '%_dif_avg_u_q3_q1', '%_dif_avg_lat_q3_q1', 'Significant_deterioration']
print("Total count for municipalities with (1) and without (0) significant deterioration:", fixed_analysis_no_geo_thr['Significant_deterioration'].value_counts())
print("Average values per feature for municipalities with (1) and without (0) significant deterioration:")
display(fixed_analysis_no_geo_thr[features_general].pivot_table(index = 'Significant_deterioration', values=features_general, aggfunc='mean'))
display(fixed_analysis_no_geo_thr[features_vendor].pivot_table(index = 'Significant_deterioration', values=features_vendor, aggfunc='mean'))
display(fixed_analysis_no_geo_thr[features_tech].pivot_table(index = 'Significant_deterioration', values=features_tech, aggfunc='mean'))
display(fixed_analysis_no_geo_thr[features_tech_dif].pivot_table(index = 'Significant_deterioration', values=features_tech_dif, aggfunc='mean'))
#fixed_analysis_no_geo_thr[features_tech_dif].head()
plt.figure(figsize=(7,7))
sns.heatmap(fixed_analysis_no_geo_thr[features_general].corr(),annot=True)
plt.plot()
plt.figure(figsize=(12,12))
sns.heatmap(fixed_analysis_no_geo_thr[features_vendor].corr(),annot=True)
plt.plot()
plt.figure(figsize=(12,12))
sns.heatmap(fixed_analysis_no_geo_thr[features_tech].corr(),annot=True)
plt.plot()
plt.figure(figsize=(12,12))
sns.heatmap(fixed_analysis_no_geo_thr[features_tech_dif].corr(),annot=True)
plt.plot()
# Plot correlation matriz with most relevant features:
features_sum_fixed = ['POP_DENSITY', 'GDP_CAPITA', '%_CLARO_Q3' , '%_TIGO_Q3', '%_MOVISTAR_Q3' , '%_HUGHES_Q3', '%_AZTECA_Q3','FIX_INT_PENETRATION_Q3', '%_CABLE_Q3', '%_FWA_Q3', 'Significant_deterioration']
plt.figure(figsize=(12,12))
sns.heatmap(fixed_analysis_no_geo_thr[features_sum_fixed].corr(),annot=True)
plt.plot()
features_general_vendor1 = ['MPIO_CCDGO', 'POP_DENSITY', 'GDP_CAPITA', '%_CLARO_Q3', '%_TIGO_Q3', '%_MOVISTAR_Q3', '%_ETB_Q3', 'Significant_deterioration']
features_general_vendor2 = ['MPIO_CCDGO', 'POP_DENSITY', 'GDP_CAPITA', '%_EMCALI_Q3', '%_DIRECTV_Q3','%_EDATEL_Q3', '%_HV_MULT_Q3', 'Significant_deterioration']
features_general_vendor3 = ['MPIO_CCDGO', 'POP_DENSITY', 'GDP_CAPITA', '%_HUGHES_Q3', '%_AZTECA_Q3', '%_OTHERS','Significant_deterioration']
sns.pairplot(fixed_analysis_no_geo_thr[features_general_vendor1], hue = 'Significant_deterioration')
sns.pairplot(fixed_analysis_no_geo_thr[features_general_vendor2], hue = 'Significant_deterioration')
sns.pairplot(fixed_analysis_no_geo_thr[features_general_vendor3], hue = 'Significant_deterioration')
features_tech_vendor1 = ['FIX_INT_PENETRATION_Q3', '%_FTTH_Q3', '%_FTTX_Q3', '%_CABLE_Q3', '%_FWA_Q3', '%_COPPER_Q3','%_CLARO_Q3', '%_TIGO_Q3', '%_MOVISTAR_Q3', '%_ETB_Q3', 'Significant_deterioration']
features_tech_vendor2 = ['FIX_INT_PENETRATION_Q3', '%_FTTH_Q3', '%_FTTX_Q3', '%_CABLE_Q3', '%_FWA_Q3', '%_COPPER_Q3', '%_EDATEL_Q3', '%_HV_MULT_Q3', 'Significant_deterioration']
features_tech_vendor3 = ['FIX_INT_PENETRATION_Q3', '%_FTTH_Q3', '%_FTTX_Q3', '%_CABLE_Q3', '%_FWA_Q3', '%_COPPER_Q3', '%_HUGHES_Q3', '%_AZTECA_Q3', '%_OTHERS','Significant_deterioration']
sns.pairplot(fixed_analysis_no_geo_thr[features_tech_vendor1], hue = 'Significant_deterioration')
sns.pairplot(fixed_analysis_no_geo_thr[features_tech_vendor2], hue = 'Significant_deterioration')
sns.pairplot(fixed_analysis_no_geo_thr[features_tech_vendor3], hue = 'Significant_deterioration')
features_general_tech = ['POP_DENSITY', 'GDP_CAPITA', 'FIX_INT_PENETRATION_Q3', '%_FTTH_Q3', '%_FTTX_Q3', '%_CABLE_Q3', '%_FWA_Q3', '%_COPPER_Q3', 'Significant_deterioration']
sns.pairplot(fixed_analysis_no_geo_thr[features_general_tech], hue = 'Significant_deterioration')
features_tech_dif = ['dif_q3_q1_fix_access', 'dif_q3_q1_FTTH', 'dif_q3_q1_FTTX', 'dif_q3_q1_CABLE', 'dif_q3_q1_FWA', 'dif_q3_q1_COPPER', 'Significant_deterioration']
sns.pairplot(fixed_analysis_no_geo_thr[features_tech_dif], hue = 'Significant_deterioration')
features_sum_fixed = ['POP_DENSITY', 'GDP_CAPITA', '%_CLARO_Q3' , '%_TIGO_Q3', '%_HUGHES_Q3', '%_AZTECA_Q3','FIX_INT_PENETRATION_Q3', '%_CABLE_Q3', '%_FWA_Q3', 'Significant_deterioration']
sns.pairplot(fixed_analysis_no_geo_thr[features_sum_fixed], hue = 'Significant_deterioration')
features_general_mob = ['MPIO_CCDGO', 'POPULATION_Q3', 'POP_DENSITY', 'GDP_CAPITA', '%_dif_avg_d_q3_q1', '%_dif_avg_u_q3_q1', '%_dif_avg_lat_q3_q1', 'Significant_deterioration']
features_vendor_mob = ['%_M_CLARO_Q3', '%_M_TIGO_Q3', '%_M_MOVISTAR_Q3', '%_M_ETB_Q3', '%_M_AVANTEL_Q3', '%_dif_avg_d_q3_q1', '%_dif_avg_u_q3_q1', '%_dif_avg_lat_q3_q1', 'Significant_deterioration']
features_tech_mob = ['%_LTE_Q3','FIX_INT_PENETRATION_Q3', '%_FTTH_Q3', '%_FTTX_Q3', '%_CABLE_Q3', '%_FWA_Q3', '%_COPPER_Q3', 'avg_d_kbps_q1', 'avg_u_kbps_q1', 'avg_lat_ms_q1','%_dif_avg_d_q3_q1', '%_dif_avg_u_q3_q1', '%_dif_avg_lat_q3_q1', 'Significant_deterioration']
features_tech_mob_dif = ['dif_q3_q1_%_LTE','dif_q3_q1_COBERTURA LTE', 'dif_q3_q1_COBERTURA 3G', 'dif_q3_q1_COBERTURA 2G','%_dif_avg_d_q3_q1', '%_dif_avg_u_q3_q1', '%_dif_avg_lat_q3_q1', 'Significant_deterioration']
print("Total count for municipalities with (1) and without (0) significant deterioration: \n", mobile_analysis_no_geo_thr['Significant_deterioration'].value_counts())
print("Average values per feature for municipalities with (1) and without (0) significant deterioration:")
display(mobile_analysis_no_geo_thr[features_general_mob].pivot_table(index = 'Significant_deterioration', values=features_general_mob, aggfunc='mean'))
display(mobile_analysis_no_geo_thr[features_vendor_mob].pivot_table(index = 'Significant_deterioration', values=features_vendor_mob, aggfunc='mean'))
display(mobile_analysis_no_geo_thr[features_tech_mob].pivot_table(index = 'Significant_deterioration', values=features_tech_mob, aggfunc='mean'))
display(mobile_analysis_no_geo_thr[features_tech_mob_dif].pivot_table(index = 'Significant_deterioration', values=features_tech_mob_dif, aggfunc='mean'))
plt.figure(figsize=(7,7))
sns.heatmap(mobile_analysis_no_geo_thr[features_general_mob].corr(),annot=True)
plt.plot()
plt.figure(figsize=(7,7))
sns.heatmap(mobile_analysis_no_geo_thr[features_vendor_mob].corr(),annot=True)
plt.plot()
plt.figure(figsize=(12,12))
sns.heatmap(mobile_analysis_no_geo_thr[features_tech_mob].corr(),annot=True)
plt.plot()
plt.figure(figsize=(7,7))
sns.heatmap(mobile_analysis_no_geo_thr[features_tech_mob_dif].corr(),annot=True)
plt.plot()
features_sum_mob = ['POP_DENSITY', 'COBERTURA 2G_Q3', 'COBERTURA 3G_Q3', '%_LTE_Q3','FIX_INT_PENETRATION_Q3', '%_CABLE_Q3', '%_FWA_Q3','Significant_deterioration' ]
plt.figure(figsize=(10,10))
sns.heatmap(mobile_analysis_no_geo_thr[features_sum_mob].corr(),annot=True)
plt.plot()
features_general_vendor = ['MPIO_CCDGO', 'POP_DENSITY', 'GDP_CAPITA', '%_M_CLARO_Q3', '%_M_TIGO_Q3', '%_M_MOVISTAR_Q3', '%_M_ETB_Q3', '%_M_AVANTEL','Significant_deterioration']
sns.pairplot(mobile_analysis_no_geo_thr[features_general_vendor], hue = 'Significant_deterioration')
features_tech_vendor_mob = ['%_LTE_Q3','FIX_INT_PENETRATION_Q3', '%_M_CLARO_Q3', '%_M_TIGO_Q3', '%_M_MOVISTAR_Q3', '%_M_ETB_Q3', '%_M_AVANTEL_Q3','Significant_deterioration']
sns.pairplot(mobile_analysis_no_geo_thr[features_tech_vendor_mob], hue = 'Significant_deterioration')
features_general_tech_mob = ['MPIO_CCDGO', 'POP_DENSITY', 'GDP_CAPITA', 'FIX_INT_PENETRATION_Q3', '%_LTE_Q3', 'Significant_deterioration']
sns.pairplot(mobile_analysis_no_geo_thr[features_general_tech_mob], hue = 'Significant_deterioration')
sns.pairplot(mobile_analysis_no_geo_thr[features_sum_mob], hue = 'Significant_deterioration')
#pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', None)
fixed_analysis_no_geo_thr= pd.read_csv("gs://imperial_analytics_project/backup/fixed_analysis_no_geo_thr.csv").drop(['Unnamed: 0'], axis=1)
mobile_analysis_no_geo_thr= pd.read_csv("gs://imperial_analytics_project/backup/mobile_analysis_no_geo_thr.csv").drop(['Unnamed: 0'], axis=1)
fixed_class = fixed_analysis_no_geo_thr[['avg_d_kbps_q1',
'avg_u_kbps_q1',
'avg_lat_ms_q1',
'POPULATION',
'NBR_TECH_ACCESS_Q3',
'AVG_OFFER_DL_FIX_ACC_Q3',
'AVG_OFFER_UL_FIX_ACC_Q3',
'dif_q3_q1_fix_access',
'dif_q3_q1_FTTH',
'dif_q3_q1_FTTX',
'dif_q3_q1_CABLE',
'dif_q3_q1_FWA',
'dif_q3_q1_COPPER',
'POP_DENSITY',
'GDP_CAPITA']].reset_index().drop('index', axis = 1)
print("Fixed Classification DataFrame created Shape:", fixed_class.shape)
fixed_class.describe()
fixed_class_scale = pd.DataFrame()
#original_columns = fixed_class.columns
for column in fixed_class:
log_column = str('log_')+str(column)
if fixed_class[column].min()>0:
fixed_class_scale[log_column] = np.log1p(fixed_class[column])
else:
fixed_class_scale[log_column] = np.log1p(fixed_class[column] - fixed_class[column].min())
fixed_class_scale = fixed_class_scale.join(fixed_analysis_no_geo_thr[['%_dif_avg_d_q3_q1', '%_dif_avg_u_q3_q1', '%_dif_avg_lat_q3_q1', 'FIX_INT_PENETRATION', '%_FTTH_Q3', '%_FTTX_Q3', '%_CABLE_Q3', '%_FWA_Q3', '%_COPPER_Q3', '%_CLARO_Q3', '%_TIGO_Q3', '%_MOVISTAR_Q3', '%_ETB_Q3', '%_EMCALI_Q3', '%_DIRECTV_Q3', '%_EDATEL_Q3', '%_HV_MULT_Q3', '%_HUGHES_Q3', '%_AZTECA_Q3', '%_OTHERS_Q3']].reset_index().drop("index", axis = 1))
fixed_class_scale.describe()
#scaler = preprocessing.StandardScaler().fit(fixed_class_scale)
#scaler = preprocessing.QuantileTransformer().fit(fixed_class_scale)
scaler = preprocessing.MinMaxScaler().fit(fixed_class_scale)
fixed_class_scale = pd.DataFrame(data = scaler.transform(fixed_class_scale))
fixed_class_scale = fixed_class_scale.rename(columns={
0 : 'log_avg_d_kbps_q1',
1 : 'log_avg_u_kbps_q1',
2 : 'log_avg_lat_ms_q1',
3 : 'log_POPULATION',
4 : 'log_NBR_TECH_ACCESS_Q3',
5 : 'log_AVG_OFFER_DL_FIX_ACC_Q3',
6 : 'log_AVG_OFFER_UL_FIX_ACC_Q3',
7 : 'log_dif_q3_q1_fix_access',
8 : 'log_dif_q3_q1_FTTH',
9 : 'log_dif_q3_q1_FTTX',
10 : 'log_dif_q3_q1_CABLE',
11 : 'log_dif_q3_q1_FWA',
12 : 'log_dif_q3_q1_COPPER',
13 : 'log_POP_DENSITY',
14 : 'log_GDP_CAPITA',
15 : '%_dif_avg_d_q3_q1',
16 : '%_dif_avg_u_q3_q1',
17 : '%_dif_avg_lat_q3_q1',
18 : 'FIX_INT_PENETRATION',
19 : '%_FTTH_Q3',
20 : '%_FTTX_Q3',
21 : '%_CABLE_Q3',
22 : '%_FWA_Q3',
23 : '%_COPPER_Q3',
24 : '%_CLARO_Q3',
25 : '%_TIGO_Q3',
26 : '%_MOVISTAR_Q3',
27 : '%_ETB_Q3',
28 : '%_EMCALI_Q3',
29 : '%_DIRECTV_Q3',
30 : '%_EDATEL_Q3',
31 : '%_HV_MULT_Q3',
32 : '%_HUGHES_Q3',
33 : '%_AZTECA_Q3',
34 : '%_OTHERS_Q3'
})
fixed_class_scale.describe()
mobile_class = mobile_analysis_no_geo_thr[['avg_d_kbps_q1',
'avg_u_kbps_q1',
'avg_lat_ms_q1',
'POPULATION',
'FIX_INT_PENETRATION',
'NBR_TECH_ACCESS_Q3',
'COBERTURA 2G_Q3',
'COBERTURA 3G_Q3',
'COBERTURA HSPA+, HSPA+DC_Q3',
'COBERTUTA LTE_Q3',
'dif_q3_q1_COBERTURA LTE',
'dif_q3_q1_COBERTURA 3G',
'dif_q3_q1_COBERTURA 2G',
'POP_DENSITY',
'GDP_CAPITA']].reset_index().drop('index', axis = 1)
print("Mobile Classification DataFrame created Shape:", mobile_class.shape)
mobile_class.describe()
mobile_class_scale = pd.DataFrame()
for column in mobile_class:
log_column = str('log_')+str(column)
if mobile_class[column].min()>0:
mobile_class_scale[log_column] = np.log1p(mobile_class[column])
else:
mobile_class_scale[log_column] = np.log1p(mobile_class[column] - mobile_class[column].min())
mobile_class_scale = mobile_class_scale.join(mobile_analysis_no_geo_thr[['%_dif_avg_d_q3_q1', '%_dif_avg_u_q3_q1', '%_dif_avg_lat_q3_q1', '%_LTE_Q3', '%_M_AVANTEL_Q3', '%_M_TIGO_Q3', '%_M_MOVISTAR_Q3', '%_M_CLARO_Q3', '%_M_ETB_Q3', 'dif_q3_q1_%_LTE']].reset_index().drop("index", axis = 1))
mobile_class_scale.describe()
#scaler = preprocessing.StandardScaler().fit(mobile_class)
#scaler = preprocessing.QuantileTransformer().fit(mobile_class)
scaler = preprocessing.MinMaxScaler().fit(mobile_class_scale)
mobile_class_scale = pd.DataFrame(data = scaler.transform(mobile_class_scale))
mobile_class_scale = mobile_class_scale.rename(columns={
0 : 'log_avg_d_kbps_q1',
1 : 'log_avg_u_kbps_q1',
2 : 'log_avg_lat_ms_q1',
3 : 'log_POPULATION',
4 : 'log_FIX_INT_PENETRATION',
5 : 'log_NBR_TECH_ACCESS_Q3',
6 : 'log_COBERTURA 2G_Q3',
7 : 'log_COBERTURA 3G_Q3',
8 : 'log_COBERTURA HSPA+, HSPA+DC_Q3',
9 : 'log_COBERTUTA LTE_Q3',
10 : 'log_dif_q3_q1_COBERTURA LTE',
11 : 'log_dif_q3_q1_COBERTURA 3G',
12 : 'log_dif_q3_q1_COBERTURA 2G',
13 : 'log_POP_DENSITY',
14 : 'log_GDP_CAPITA',
15 : '%_dif_avg_d_q3_q1',
16 : '%_dif_avg_u_q3_q1',
17 : '%_dif_avg_lat_q3_q1',
18 : '%_LTE_Q3',
19 : '%_M_AVANTEL_Q3',
20 : '%_M_TIGO_Q3',
21 : '%_M_MOVISTAR_Q3',
22 : '%_M_CLARO_Q3',
23 : '%_M_ETB_Q3',
24 : 'dif_q3_q1_%_LTE'
})
mobile_class_scale.describe()
fixed_class_scale.to_csv("gs://imperial_analytics_project/backup/fixed_class_scale.csv")
mobile_class_scale.to_csv("gs://imperial_analytics_project/backup/mobile_class_scale.csv")
#fixed_class_scale = pd.read_csv("gs://imperial_analytics_project/backup/fixed_class_scale.csv").drop(['Unnamed: 0'], axis=1)
#mobile_class_scale = pd.read_csv("gs://imperial_analytics_project/backup/mobile_class_scale.csv").drop(['Unnamed: 0'], axis=1)
fixed_class.to_csv("gs://imperial_analytics_project/backup/fixed_class.csv")
mobile_class.to_csv("gs://imperial_analytics_project/backup/mobile_class.csv")
#fixed_class = pd.read_csv("gs://imperial_analytics_project/backup/fixed_class.csv").drop(['Unnamed: 0'], axis=1)
#mobile_class = pd.read_csv("gs://imperial_analytics_project/backup/mobile_class.csv").drop(['Unnamed: 0'], axis=1)
# Elbow Curve
x_fixed_class_scale = np.array(fixed_class_scale)
Nc = range(1, 40)
kmeans = [KMeans(n_clusters=i) for i in Nc]
kmeans
score = [kmeans[i].fit(fixed_class_scale).score(x_fixed_class_scale) for i in range(len(kmeans))]
score
plt.plot(Nc,score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()
kmeans = KMeans(n_clusters=7).fit(x_fixed_class_scale)
centroids = kmeans.cluster_centers_
#print(centroids)
labels_fixed = kmeans.predict(x_fixed_class_scale)
#print(labels_fixed)
print(labels_fixed.size)
df_labels_fixed = pd.DataFrame(labels_fixed).rename(columns={0 : 'CLUSTER'})
df_labels_fixed['CLUSTER'].value_counts()
fixed_class_clust = fixed_analysis_no_geo_thr[['DPTO_MPIO', 'MPIO_CNMBR', 'DPTO_CNMBR', '%_dif_avg_d_q3_q1', '%_dif_avg_u_q3_q1', '%_dif_avg_lat_q3_q1', 'FIX_INT_PENETRATION', '%_FTTH_Q3', '%_FTTX_Q3', '%_CABLE_Q3', '%_FWA_Q3', '%_COPPER_Q3', '%_CLARO_Q3', '%_TIGO_Q3', '%_MOVISTAR_Q3', '%_ETB_Q3', '%_EMCALI_Q3', '%_DIRECTV_Q3', '%_EDATEL_Q3', '%_HV_MULT_Q3', '%_HUGHES_Q3', '%_AZTECA_Q3', '%_OTHERS_Q3', 'Significant_deterioration']].reset_index().drop("index", axis = 1)
fixed_class_clust = fixed_class_clust.join(fixed_class).join(df_labels_fixed)
features_fixed = fixed_class_clust.columns.tolist()
features_fixed = [i for i in features_fixed if i not in ('DPTO_MPIO', 'MPIO_CNMBR', 'DPTO_CNMBR')]
#display(fixed_class_clust.loc[(fixed_class_clust['CLUSTER'] == 2)].head())
display(fixed_class_clust.head())
fixed_class_clust['CLUSTER'].value_counts()
fixed_class_clust[features_fixed].pivot_table(index = 'CLUSTER', values=features_fixed, aggfunc='mean')
# Elbow Curve
x_mobile_class_scale = np.array(mobile_class_scale)
Nc = range(1, 40)
kmeans = [KMeans(n_clusters=i) for i in Nc]
kmeans
score = [kmeans[i].fit(mobile_class_scale).score(x_mobile_class_scale) for i in range(len(kmeans))]
score
plt.plot(Nc,score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()
kmeans = KMeans(n_clusters=6).fit(x_mobile_class_scale)
centroids = kmeans.cluster_centers_
#print(centroids)
labels_mobile = kmeans.predict(x_mobile_class_scale)
#print(labels_fixed)
labels_mobile.size
df_labels_mobile = pd.DataFrame(labels_mobile).rename(columns={0 : 'CLUSTER'})
df_labels_mobile['CLUSTER'].value_counts()
mobile_class_clust = mobile_analysis_no_geo_thr[['DPTO_MPIO', 'MPIO_CNMBR', 'DPTO_CNMBR', 'Significant_deterioration', 'avg_d_kbps_q3', 'avg_lat_ms_q3', 'avg_u_kbps_q3']].reset_index().drop('index', axis = 1)
mobile_class_clust = mobile_class_clust.join(mobile_class).join(df_labels_mobile)
features_mobile = mobile_class_clust.columns.tolist()
features_mobile = [i for i in features_mobile if i not in ('DPTO_MPIO', 'MPIO_CNMBR', 'DPTO_CNMBR')]
#display(mobile_class_clust.loc[(mobile_class_clust['CLUSTER'] == 1)].head())
display(mobile_class_clust.head())
mobile_class_clust['CLUSTER'].value_counts()
mobile_class_clust[features_mobile].pivot_table(index = 'CLUSTER', values=features_mobile, aggfunc='mean')
fixed_scale_no_diff = fixed_class_scale.drop(['%_dif_avg_d_q3_q1', '%_dif_avg_u_q3_q1', '%_dif_avg_lat_q3_q1'], axis = 1)
display(fixed_scale_no_diff.head())
fixed_scale_no_diff.shape
# Recursive feature elimination
lr_fixed = LogisticRegression(max_iter=1000)
rfe_f = RFE(lr_fixed, round(fixed_scale_no_diff.shape[1]*0.70,0))
rfe_f = rfe_f.fit(fixed_scale_no_diff, np.array(fixed_analysis_no_geo_thr['Significant_deterioration']))
print(rfe_f.support_)
#print(rfe_f.ranking_)
# Filter features according to RFE result
rfe_features = np.array(fixed_scale_no_diff.columns.tolist())[rfe_f.support_].tolist()
onehot_cluster_fixed = pd.get_dummies(df_labels_fixed['CLUSTER'], prefix = "CLUSTER", prefix_sep="_")
X_fixed = fixed_scale_no_diff[rfe_features].join(onehot_cluster_fixed)
y_fixed = np.array(fixed_analysis_no_geo_thr['Significant_deterioration'])
X_fixed.head()
# Separate between train and test and fit model
X_train, X_test, y_train, y_test = train_test_split(X_fixed, y_fixed, random_state=10)
logr_fixed = LogisticRegression(max_iter=1000, solver= 'liblinear', warm_start = True)
logr_fixed.fit(X_train, y_train)
print("X_train size =", X_train.shape)
print("y_train size =", y_train.shape)
print("X_test size =", X_test.shape)
print("y_test size =", y_test.shape)
y_pred = logr_fixed.predict(X_test)
print('Accuracy of logistic regression classifier on test set is:', round(logr_fixed.score(X_test, y_test),2))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
df_coef = pd.DataFrame(logr_fixed.coef_).T.rename(columns = {0: 'coeficient'})
pd.set_option('display.max_rows', 34)
fixed_logreg_coeficient_matrix = pd.DataFrame(X_fixed.columns).rename(columns = {0: 'features'}).join(df_coef)
fixed_logreg_coeficient_matrix['abs_coef'] = fixed_logreg_coeficient_matrix['coeficient'].abs()
fixed_logreg_coeficient_matrix.sort_values(by= ['abs_coef'], ascending = False).head(10)
mobile_scale_no_diff = mobile_class_scale.drop(['%_dif_avg_d_q3_q1', '%_dif_avg_u_q3_q1', '%_dif_avg_lat_q3_q1'], axis = 1)
display(mobile_scale_no_diff.head())
mobile_scale_no_diff.shape
# Recursive feature elimination
lr_mobile = LogisticRegression(max_iter = 1000)
rfe_m = RFE(lr_mobile, round(mobile_scale_no_diff.shape[1]*0.70,0))
rfe_m = rfe_m.fit(mobile_scale_no_diff, mobile_analysis_no_geo_thr['Significant_deterioration'])
print(rfe_m.support_)
#print(rfe_m.ranking_)
# Filter features according to RFE result
rfe_features = np.array(mobile_scale_no_diff.columns.tolist())[rfe_m.support_].tolist()
onehot_cluster_mobile = pd.get_dummies(df_labels_mobile['CLUSTER'], prefix = "CLUSTER", prefix_sep="_")
X_mobile = mobile_scale_no_diff[rfe_features].join(onehot_cluster_mobile)
y_mobile = mobile_analysis_no_geo_thr['Significant_deterioration']
X_mobile.head()
# Separate between train and test and fit model
X_train, X_test, y_train, y_test = train_test_split(X_mobile, y_mobile, random_state=0)
logr_mobile = LogisticRegression(max_iter = 1000, warm_start = True)
logr_mobile.fit(X_train, y_train)
print("X_train size =", X_train.shape)
print("y_train size =", y_train.shape)
print("X_test size =", X_test.shape)
print("y_test size =", y_test.shape)
y_pred = logr_mobile.predict(X_test)
print('Accuracy of logistic regression classifier on test set is:', round(logr_mobile.score(X_test, y_test),2))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
df_coef = pd.DataFrame(logr_mobile.coef_).T.rename(columns = {0: 'coeficient'})
pd.set_option('display.max_rows', 34)
mobile_logreg_coeficient_matrix = pd.DataFrame(X_mobile.columns).rename(columns = {0: 'features'}).join(df_coef)
mobile_logreg_coeficient_matrix['abs_coef'] = mobile_logreg_coeficient_matrix['coeficient'].abs()
mobile_logreg_coeficient_matrix.sort_values(by= ['abs_coef'], ascending = False).head(10)
# Recursive feature elimination
fixed_scale_no_diff = fixed_class_scale.drop(['%_dif_avg_d_q3_q1', '%_dif_avg_u_q3_q1', '%_dif_avg_lat_q3_q1'], axis = 1)
dt_fixed = DecisionTreeClassifier()
rfe_f = RFE(dt_fixed, round(fixed_scale_no_diff.shape[1]*0.7,0))
rfe_f = rfe_f.fit(fixed_scale_no_diff, fixed_analysis_no_geo_thr['Significant_deterioration'])
print(rfe_f.support_)
#print(rfe_m.ranking_)
# Filter features according to RFE result
rfe_features = np.array(fixed_scale_no_diff.columns.tolist())[rfe_f.support_].tolist()
X_fixed = fixed_scale_no_diff[rfe_features].join(onehot_cluster_fixed)
y_fixed = fixed_analysis_no_geo_thr['Significant_deterioration']
X_fixed.head()
X_train, X_test, y_train, y_test = train_test_split(X_fixed, y_fixed, random_state=10)
dtc_fixed = DecisionTreeClassifier()
dtc_fixed.fit(X_train, y_train)
print("X_train size =", X_train.shape)
print("y_train size =", y_train.shape)
print("X_test size =", X_test.shape)
print("y_test size =", y_test.shape)
y_pred = dtc_fixed.predict(X_test)
print('Accuracy of decision tree classifier on test set is:', round(dtc_fixed.score(X_test, y_test),2))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
# Now let's look at how important each input predictor or feature is in classifying the Iris classes
features = X_fixed.columns.tolist()
prediction_importance = list(dtc_fixed.feature_importances_)
input_importance = [(feature, importance) for feature, importance in zip(features, prediction_importance)]
pd.DataFrame(input_importance).rename(columns={0: 'Feature', 1: 'Feature Importance'}).sort_values(by=['Feature Importance'], ascending = False).head(10)
fig = plt.figure(figsize=(10,10))
_ = tree.plot_tree(dtc_fixed,
feature_names=features,
#class_names=iris.target_names,
filled=True)
rf_fixed = RandomForestClassifier(n_estimators=100, bootstrap = True, max_features = 'auto')
rfe_f = RFE(rf_fixed, round(fixed_scale_no_diff.shape[1]*1,0))
rfe_f = rfe_f.fit(fixed_scale_no_diff, fixed_analysis_no_geo_thr['Significant_deterioration'])
print(rfe_f.support_)
#print(rfe_m.ranking_)
# Filter features according to RFE result
rfe_features = np.array(fixed_scale_no_diff.columns.tolist())[rfe_f.support_].tolist()
X_fixed = fixed_scale_no_diff[rfe_features].join(onehot_cluster_fixed)
y_fixed = fixed_analysis_no_geo_thr['Significant_deterioration']
X_fixed.head()
X_train, X_test, y_train, y_test = train_test_split(X_fixed, y_fixed, random_state=10)
rfc_fixed = RandomForestClassifier(n_estimators=100, bootstrap = True, max_features = 'auto')
rfc_fixed.fit(X_train, y_train)
print("X_train size =", X_train.shape)
print("y_train size =", y_train.shape)
print("X_test size =", X_test.shape)
print("y_test size =", y_test.shape)
y_pred = rfc_fixed.predict(X_test)
print('Accuracy of random forest classifier on test set is:', round(rfc_fixed.score(X_test, y_test),2))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
features = X_fixed.columns.tolist()
prediction_importance = list(rfc_fixed.feature_importances_)
input_importance = [(feature, importance) for feature, importance in zip(features, prediction_importance)]
pd.DataFrame(input_importance).rename(columns={0: 'Feature', 1: 'Feature Importance'}).sort_values(by=['Feature Importance'], ascending = False).head(10)
# Recursive feature elimination
dt_mobile = DecisionTreeClassifier()
rfe_m = RFE(dt_mobile, round(mobile_scale_no_diff.shape[1]*0.7,0))
rfe_m = rfe_m.fit(mobile_scale_no_diff, mobile_analysis_no_geo_thr['Significant_deterioration'])
print(rfe_m.support_)
#print(rfe_m.ranking_)
# Filter features according to RFE result
rfe_features = np.array(mobile_scale_no_diff.columns.tolist())[rfe_m.support_].tolist()
X_mobile = mobile_scale_no_diff[rfe_features].join(onehot_cluster_mobile)
y_mobile = mobile_analysis_no_geo_thr['Significant_deterioration']
X_mobile.head()
X_train, X_test, y_train, y_test = train_test_split(X_mobile, y_mobile, random_state=0)
dtc_mobile = DecisionTreeClassifier()
dtc_mobile.fit(X_train, y_train)
print("X_train size =", X_train.shape)
print("y_train size =", y_train.shape)
print("X_test size =", X_test.shape)
print("y_test size =", y_test.shape)
y_pred = dtc_mobile.predict(X_test)
print('Accuracy of decision tree classifier on test set is:', round(dtc_mobile.score(X_test, y_test),2))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
features = X_mobile.columns.tolist()
prediction_importance = list(dtc_mobile.feature_importances_)
input_importance = [(feature, importance) for feature, importance in zip(features, prediction_importance)]
pd.DataFrame(input_importance).rename(columns={0: 'Feature', 1: 'Feature Importance'}).sort_values(by=['Feature Importance'], ascending = False).head(10)
rf_mobile = RandomForestClassifier(n_estimators=100, bootstrap = True, max_features = 'auto')
rfe_m = RFE(rf_mobile, round(mobile_scale_no_diff.shape[1]*1,0))
rfe_m = rfe_m.fit(mobile_scale_no_diff, mobile_analysis_no_geo_thr['Significant_deterioration'])
print(rfe_m.support_)
#print(rfe_m.ranking_)
# Filter features according to RFE result
rfe_features = np.array(mobile_scale_no_diff.columns.tolist())[rfe_m.support_].tolist()
X_mobile = mobile_scale_no_diff[rfe_features].join(onehot_cluster_mobile)
y_mobile = mobile_analysis_no_geo_thr['Significant_deterioration']
X_mobile.head()
X_train, X_test, y_train, y_test = train_test_split(X_mobile, y_mobile, random_state=0)
rfc_mobile = RandomForestClassifier(n_estimators=100, bootstrap = True, max_features = 'auto')
rfc_mobile.fit(X_train, y_train)
print("X_train size =", X_train.shape)
print("y_train size =", y_train.shape)
print("X_test size =", X_test.shape)
print("y_test size =", y_test.shape)
y_pred = rfc_mobile.predict(X_test)
print('Accuracy of random forest classifier on test set is:', round(rfc_mobile.score(X_test, y_test),2))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
features = X_mobile.columns.tolist()
prediction_importance = list(rfc_mobile.feature_importances_)
input_importance = [(feature, importance) for feature, importance in zip(features, prediction_importance)]
pd.DataFrame(input_importance).rename(columns={0: 'Feature', 1: 'Feature Importance'}).sort_values(by=['Feature Importance'], ascending = False).head(10)
# Recursive feature elimination
#svm_fixed = svm.LinearSVC(max_iter=10000)
svm_fixed = svm.SVC(kernel = 'linear')
rfe_f = RFE(svm_fixed, round(fixed_scale_no_diff.shape[1]*0.70,0))
rfe_f = rfe_f.fit(fixed_scale_no_diff, fixed_analysis_no_geo_thr['Significant_deterioration'])
print(rfe_f.support_)
#print(rfe_m.ranking_)
# Filter features according to RFE result
rfe_features = np.array(fixed_scale_no_diff.columns.tolist())[rfe_f.support_].tolist()
X_fixed = fixed_scale_no_diff[rfe_features].join(onehot_cluster_fixed)
y_fixed = fixed_analysis_no_geo_thr['Significant_deterioration']
X_fixed.head()
X_train, X_test, y_train, y_test = train_test_split(X_fixed, y_fixed, random_state=3)
#svmc_fixed = svm.LinearSVC(max_iter=10000)
svmc_fixed = svm.SVC(kernel = 'linear')
svmc_fixed.fit(X_train, y_train)
print("X_train size =", X_train.shape)
print("y_train size =", y_train.shape)
print("X_test size =", X_test.shape)
print("y_test size =", y_test.shape)
y_pred = svmc_fixed.predict(X_test)
print('Accuracy of support vector machine classifier on test set is:', round(svmc_fixed.score(X_test, y_test),2))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
df_coef = pd.DataFrame(svmc_fixed.coef_).T.rename(columns = {0: 'coeficient'})
pd.set_option('display.max_rows', 34)
svmc_fixed_coeficient_matrix = pd.DataFrame(X_fixed.columns).rename(columns = {0: 'features'}).join(df_coef)
svmc_fixed_coeficient_matrix['abs_coef'] = svmc_fixed_coeficient_matrix['coeficient'].abs()
svmc_fixed_coeficient_matrix.sort_values(by=['abs_coef'], ascending = False).head(10)
# Recursive feature elimination
svm_mobile = svm.LinearSVC(max_iter=100000)
rfe_m = RFE(svm_mobile, round(mobile_scale_no_diff.shape[1]*0.70,0))
rfe_m = rfe_m.fit(mobile_scale_no_diff, mobile_analysis_no_geo_thr['Significant_deterioration'])
print(rfe_m.support_)
#print(rfe_m.ranking_)
# Filter features according to RFE result
rfe_features = np.array(mobile_scale_no_diff.columns.tolist())[rfe_m.support_].tolist()
X_mobile = mobile_scale_no_diff[rfe_features].join(onehot_cluster_mobile)
y_mobile = mobile_analysis_no_geo_thr['Significant_deterioration']
X_mobile.head()
X_train, X_test, y_train, y_test = train_test_split(X_mobile, y_mobile, random_state=0)
svmc_mobile = svm.LinearSVC(max_iter=100000)
svmc_mobile.fit(X_train, y_train)
print("X_train size =", X_train.shape)
print("y_train size =", y_train.shape)
print("X_test size =", X_test.shape)
print("y_test size =", y_test.shape)
y_pred = svmc_mobile.predict(X_test)
print('Accuracy of support vector machine classifier on test set is:', round(svmc_mobile.score(X_test, y_test),2))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
df_coef = pd.DataFrame(svmc_mobile.coef_).T.rename(columns = {0: 'coeficient'})
pd.set_option('display.max_rows', 34)
svmc_mobile_coeficient_matrix = pd.DataFrame(X_mobile.columns).rename(columns = {0: 'features'}).join(df_coef)
svmc_mobile_coeficient_matrix['abs_coef'] = svmc_mobile_coeficient_matrix['coeficient'].abs()
svmc_mobile_coeficient_matrix.sort_values(by=['abs_coef'], ascending = False).head(10)
This document was prepared by Juan David Rodriguez for academic purposes. Sources of information were public for without use limitation or licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
The contents of this document are proprietary by Juan David Rodriguez. Contents are licensed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
The contents of this document are provided "as is". No warranties of any kind, either express or implied. OWNER SHALL NOT BE RESPONSIBLE IN ANY EVENT FOR ERRORS IN THIS DOCUMENT or for any loss of data or income or any special, incidental, consequential, indirect or direct damages howsoever caused, that might arise from the use of this document or any contents of this document.